We-the-People-civ4col-mod / Mod

This is the repository where the mod resides.
90 stars 37 forks source link

Clean up XML loading #926

Open Nightinggale opened 1 year ago

Nightinggale commented 1 year ago

We still largely have the vanilla approach to loading xml files. We (mainly I) have made improvements to combat a number of the vanilla issues, but we still mostly have vanilla code with vanilla issues.

The vanilla approach is to load the xml files in a semi random order and store the data in vectors of info pointers. To reach the main menu faster, some xml files aren't loaded until after a new game is started/loaded.

We now have a function, which lists all the xml files and where they should be stored. This is then called multiple times:

  1. Load lengths of xml files (technically done elsewhere to do it prior to CvGlobals constructor)
  2. Call CvInfoBase::read (each instance)
  3. Call GC.postXMLLoad(first)
  4. Call childClass::read (each instance)
  5. Call childClass::postLoadSetup (each instance if child class has this function)
  6. Call GC.postXMLLoad(not first)

Doing all those steps makes it harder to introduce bugs due to load order. For instance all strings are read at step 2, but cross connecting entries based on strings happens in step 4 meaning a file can use string types from a yet to be loaded xml file (unlike vanilla).


Problems/possible solutions

readpass2 and readpass3

Since vanilla had to handle reading types for future entries, two functions were added for that purpose. The strings are then stored in the class and when it is called again, those strings can then be resolved to int values.

This is very unorganized and wasteful. No need to store all those strings when the strings are available when the current code sets up string storage for readpass2+3. We should get rid of those two functions

XML access is a simple pointer

Reading is done by getting a CvXMLLoadUtility pointer from the exe. This accesses the xml file through the tinyxml2 library, which is compiled into the exe. This grants low level xml access, but there is no smart logic when reading xml files. You mostly just have the current xml tag and that's it. To make this worse, the exe doesn't even expose the full tinyxml2 function set, making navigation of the xml file harder. We kind of have to hope the xml modder did it correctly. Also the few errors vanilla can provide tells absolutely nothing. It can be on the level where something went wrong, but it doesn't say what and the reported file might be the offending one, but not always.

It would be nice to have a class as a wrapper to the CvXMLLoadUtility pointer. One, which would be able to keep track of where in the file it is, report proper error messages, and generally have an easier API to load the individual tags. In fact we can go one step further and ditch the exe entirely now that we have tinyxml2 in the DLL anyway. That would require us to not use the exe to read the file in the mod and use vanilla as fallback. However the DLL has the path to the DLL file and the exe file, so we can construct both paths and make our own fallback file locator.

Comments in xml can crash the DLL

do
{
    SkipToNextVal();
} while (gDLL->getXMLIFace()->NextSibling(m_pFXml));

Vanilla uses code like this. The problem is that NextSibling can be a comment. This is fixed with SkipToNextVal, which passes any comment nodes. However if there is only comment nodes left, it will move forward until the node pointer is NULL. If it tries to use the NULL pointer to read, it crashes.

We need some sort of NextSiblingNoComment or better yet: change the entire approach.

Nested xml tags are a pain

if ( gDLL->getXMLIFace()->SetToChildByTagName(pXML->GetXML(),"TagName") )
{
    // some code
    gDLL->getXMLIFace()->SetToParent(pXML->GetXML());
}

There is currently no clean way of walking through the xml layout. The programmer has to remember to move up and down between parents and children. This can be a huge pain, particularly in nested structures where cleanup means going to parent multiple times.

The result of how crude this system works has made modders (and vanilla developers?) add most data to the root of each entry, making it very unorganized.

if ( gDLL->getXMLIFace()->SetToChildByTagName(pXML->GetXML(),"TagName") )
{
    Cleaner clean(pXML); // can be first line
}

This would be one approach. Cleaner (or whatever it should be called) can be set right after changing level and it returns to parent in the deconstructor. This would be particularly useful with nested read functions, which contains returns.

However tinyxml2 doesn't rely exclusively on a file pointer. Navigating the tags in a file is done by node pointers. This means it can be possible to request a pointer to the child node while preserving the pointer to the parent. That way no cleanup will be needed.

Reading enum values

pXML->GetChildXmlValByName(szTextVal, "Special");
m_iSpecialUnitType = pXML->FindInInfoClass(szTextVal);

This is vanilla code. It's crude in the sense that there is no check to see if the read string is of type SpecialUnitTypes. I can enter BONUS_TIMBER, which is at index 0 in the bonus info file, hence int m_iSpecialUnitType becomes 0 (SPECIALUNIT_YIELD_CARGO). Vanilla accepts this without any warning or error. Also it encourages storing enum values as ints in the info classes as that avoids the need to typecast.

I added pXML->GetEnum to make it a one line read and lock it to a specific enum type. However That function is from the early days of WTP and as such isn't using features added later (all the perl script output), so it's kind of crude in arguments and internal workings.

Ideally we will get a new function to do this, which works better internally and takes less arguments because it's a member function of a class instance, which stores some of the arguments between calls.

Reading lists

pXML->SetVariableListTagPair(&m_abFreePromotions, "FreePromotions", GC.getNumPromotionInfos(), false);

This code is bad on many levels. First of all, if GC.getNumPromotionInfos() has a mismatch with the number of child nodes in xml, then the loading code will trigger an unrecoverable error, but there are other issues related to how the xml file is written. It is a very fragile function. Furthermore the result ends up in a C style array, in this case bool*. There is no logic to ensure that it's allocated, read arguments are within range or anything like that. It also needs custom code to free the memory should the instance be unallocated.

Solution: use EnumMap or InfoArray. Both are designed to not only be more memory efficient than this, but also be type safe, range check, freeing memory in deconstructor etc.

Reading into vectors

if (gDLL->getXMLIFace()->SetToChildByTagName(pXML->GetXML(),"YieldsConsumed"))
    {
        if (pXML->SkipToNextVal())
        {
            int iNumSibs = gDLL->getXMLIFace()->GetNumChildren(pXML->GetXML());
            m_aiYieldsConsumed.clear();
            if (0 < iNumSibs)
            {
                if (pXML->GetChildXmlVal(szTextVal))
                {
                    for (int j = 0; j < iNumSibs; j++)
                    {
                        m_aiYieldsConsumed.push_back(pXML->FindInInfoClass(szTextVal));
                        if (!pXML->GetNextXmlVal(&szTextVal))
                        {
                            break;
                        }
                    }
                    gDLL->getXMLIFace()->SetToParent(pXML->GetXML());
                }
            }
        }
        gDLL->getXMLIFace()->SetToParent(pXML->GetXML());
    }

Super complex way to get a vector, which does the same as an InfoArray, except only InfoArrays are type safe. Also InfoArrays can handle more complex structures where each entry has 1-4 values, not just one.

Current code reading an InfoArray:

readXML(m_info_YieldDemands, "YieldDemands");

CvInfoBase can likely be reduced significantly in size

This class has a bunch of data (primarily strings) for just in case a child class needs it. We should get rid of as much as possible as it bloats the xml storage memory usage. Also it's full of virtual functions, which aren't used as virtual because they are called using a child instance. The only virtual function actually being used as virtual is the deconstructor and if we switch from vectors of instance pointers to vectors of instances, then the deconstructor will not need to be virtual either.

The main reason why this class remains mostly untouched is that it's unknown how much the exe relies on anything in this class, like does it use the virtual function pointers? It sounds like poor coding style to rely on virtual functions in an EXE<->DLL interface, but it seems that this is actually the case in CvPlayer. Just removing the virtual keywords will make the exe do something, which crashes. That in itself is pending further investigation.

sengir commented 12 months ago

I haven't tested if this holds for Civ4Col, but Civ4BTS has an unfortunate limitation on CvInfo*classes. Firaxis's habit of sharing headers between the EXE and the DLL usually just means that we need to provide the expected DllExport'd functions and avoid mucking up a handful of virtual method tables. However, they accidentally depended on a rather a crucial implementation detail in CvInfos.h: object size and member offsets.

This is because the EXE freely casts between CvInfo* pointers. This is normally fine, but some CvInfo* classes use multiple inheritance (e.g. CvArtInfoScalableAsset) and casting to/from the second base class involves modifying the pointer value by some comptime-known offset. This offset is hardcoded into the EXE.

What this means for us is that we can't change the size of any base class that directly or indirectly gets included in the first base class of a CvInfo* class with multiple base classes. Or more accurately, we need to maintain the offsets of base classes.

Nightinggale commented 12 months ago

We should likely follow #937 to make wrapper classes for the exe to use in case it relies on memory offsets. That will make the code more stable in general.

What this means for us is that we can't change the size of any base class that directly or indirectly gets included in the first base class of a CvInfo* class with multiple base classes. Or more accurately, we need to maintain the offsets of base classes.

That's precisely the reason why I haven't dared to clean up unused virtual functions in the base class.

Nightinggale commented 11 months ago

One thing, which has been mentioned in chat, but not here is xml schemas. Tinyxml2 is a good lightweight xml interface, which is easy to work with, but in order to keep it that way, the developers have intentionally made it ignore schema files.

Another issue with schema files is that vanilla uses a dead format. Today W3C schemas are the standard. Switching will allow us to use third party xml editors.

We would have to have our own schema validation code, but we would have to make that for either schema file system. In order to keep it simple and lightweight, we can use a subset of the standard. We don't actually need much of the schema features and even with a subset, external tools like xml editors will still work.

<xs:element name="car" type="carType"/>

<xs:simpleType name="carType">
  <xs:restriction base="xs:string">
    <xs:enumeration value="Audi"/>
    <xs:enumeration value="Golf"/>
    <xs:enumeration value="BMW"/>
  </xs:restriction>
</xs:simpleType> 

Copy pasted this example. We can make one of those for each xml file, like say one for UnitTypes. Since the schema files support importing other schema files, we can make one common autogenerated schema file, which then tells editors valid values for UnitTypes in all schema files, hence all our xml files.

Our DLL will not have to support importing or understanding this layout. We can make it trigger on type="UnitTypes" and then our code will know which array of xml info where it should look for the string in question.

We can add a tag into xml files, which essentially means "continues in this file". That way we can split huge xml files into smaller ones. Since it's a tag in the xml file itself, it will be set in a single location and both dll, enum building script and xml schema building script will have a single location to look for, hence they will stick to the same order.