Excel class API changes

collinsmith commented 5 years ago

I want to look into changing some functionality of Excel and improving it. Some of these are tentative

[x] remove reliance on TXT/optimize -- TXT isn't used directly anywhere, so I want to look into if any optimizations can be made in the excel loading if I don't have to go through TXT first, however it might be the case that TXT can be interchangeable with BIN when that is added
[x] add support for BIN files which can be parsed faster than TXT and contain a subset of columns (namely removing the metadata columns not needed for a release version)
[ ] foreign key support -- flag columns as foreign keys which can be assigned at a later time (second pass of the parser once the other tables have loaded). This is an extremely common use-case, so it might improve readability if this were the case that I could simply replace these columns with references to the corresponding data, instead of the index of that data. I want to avoid tightly coupling the tables with Riiablo.files if I can though.
[ ] transition from cascading if-statements to adding a map indexed on type with parsers for the excel columns. I think this will improve speed and extensibility, but I also want to avoid auto-boxing the data wherever possible
[ ] it might be nice to support parsing data structures, e.g., some excel files contain multiple numbered columns (stat1, param1, stat2, param2), but I think it might be nice if this could be parseable into stat[] object consisting of a stat, param instead of stat[], param[], this just keeps associated data together
[ ] related to above point, it would be nice to support multi-dimensional arrays, or at least 2D. E.g., some stats have copies for each difficulty, and partial set bonuses would be easier to implement in a 2D array.

This might be doable when some #8 improvements on loading TXT files are made.

collinsmith commented 5 years ago

Another good use-case for 2D arrays -- quests.txt has 3 indexes for each qsts.

collinsmith commented 4 years ago

I spent some time looking into how much of this issue is related with #8. I wanted to find specific reasons why loading monstats and monstats2 took so long on mobile (an upwards of 5 seconds). I thought originally this was due to how Excel uses reflection to assign the values, and it looks like it's actually an issue with how integers (and other values) are parsed. Unfortunately there doesn't seem to be a trick to perform this any faster, as I should be using the fastest method available in Java. For reference, this is done 163x733 in monstats for 119479 integer parses.

Due to this behavior, I think I'm coming to the same conclusion that original game did, and I'm going to not rely on the excel files during runtime and generate packed bin files instead. The game files already contain pre-generated bin files for some of the important excels, so perhaps those can be leveraged, however not all of the columns are obvious (many strings are converted to some int identifier), so it might be easier to just generate my own bin files the first time the application is run (or if a certain arg is present, similar to -txt in original game client). I assume this wouldn't add any significant performance hit to the current mobile client, however subsequent launches should show a significant performance boost.

For TXT parsing later: I think it's plausible to perform some optimization on TXT to read the incoming data as a number (verses reading everything as strings and then parsing as ints later), but I think this performance boost would be negligible.

collinsmith commented 4 years ago

I started writing my own (de)serializers for monstats, and preliminary tests look extremely promising. Down to 225ms from 4669ms or about 95% reduction. It seems pretty obvious this is the way to go, and there are still some fairly significant optimizations that can be made in terms of excluding unneeded columns and storing certain strings as index references (e.g., string table references can be saved as the actual string table index rather than reference string).

The plan I'm working with right now is to generate my own bin files for (at least some) txt excel files. I'm still working out the generation process and figuring out if this should be supported by all files*. I wrote a code generation tool to perform most of the heavy lifting on monstats and any other excel file (spits out the read/write code to be copy/pasted into the source code), however I still need to test equality of txt excel objects versus bin excel objects to make sure that the serialization process is performed correctly (basic tests look fine though).

*The majority of the slowdown is due to a few specific files, so I want to see the effect of fixing those first, but eventually I'll target all files.

collinsmith commented 4 years ago

I'd like to push some of the (de)serialization responsibilities up to Excel from the subclasses, as obviously an int representing the number of rows of the table should always come first. Additionally, MonStats deserialization is taking care of the indexing requirements that should also probably be handled by Excel. The deserialization process should index exactly how the txt tables do, but I'm concerned that if Excel handles it, it could be really slow if I need to access the txt table at all, which doesn't necessarily indicate the order of which the columns are serialized. One possible solution would be to guarantee the first sequence of bytes are always the index (String or int)

I'm going to look into using a gradle task to generate the (de)serializers as a separate class in gen/, however if I want to optimize for serialization, I need to check on if customization is needed or if I can store the encoding formats as metadata. Most if not all strings are really references to internal structs (monster id, index in the i18n string table, treasure class), but each of these requires decoding internally and may rely on other tables. I should probably explain that I don't want to do this by hand for every excel -- so I'll look into no index optimization first using all generated sources.

I've decided to forgo the custom (de)serializers for now on indexed columns -- just seems like a ton of work when my current improvements get me the most bang -- I can always write custom ones where needed. Binning only monstats and monstats2 caused the time to load all excel files in the game to go from 13657ms to 6775ms on my android device, so I'm hoping this is representative of the improvements to the other files and I can get the total time down to about a second. (monstats 4356ms -> 218ms, monstats2 1695ms -> 90ms)

I have a workable version of a bin serializer generation tool which creates generated sources that read and write the excel data to binary, but I'd like to investigate improvements to the underlying design as it uses a ton of reflection right now that I'd like to look into alternatives for. If it works out, this should cut down on maintenance (re-generate serializer when change is made to excel entry class) and seriously speed up the work on getting the serializers written (there are 50+ entry classes).

Tests on the serialized structs have still not been created, but I haven't noticed any impact on the existing behavior. Integrating support for indexing the primary key was actually pretty straightforward and I was able to use the existing code from the txt loader.

This is the design I'm looking into. I think this should be sufficient for my needs and will support changing the serializers to custom ones when they are needed later on.

public @interface Binned {
  Class<? extends Serializer> value();
}

public interface Serializer<T extends Excel.Entry> {
  void readBin(T entry, DataInput in) throws IOException;
  void writeBin(T entry, DataOutput out) throws IOException;
}

public class MonStatsSerializer implements Serializer<MonStats.Entry> {
  @Override public void readBin(MonStats.Entry entry, DataInput in) {}
  @Override public void writeBin(MonStats.Entry entry, DataOutput out) {}
}

@Binned(MonStatsSerializer.class)
public class MonStats {
  public static class Entry extends Excel.Entry {
    public short   hcIdx;
    public short   BaseId;
    //...
  }
}

I could also adjust slightly to not use annotations and instead redefine Excel as:

public class Excel<T extends Excel.Entry, U extends Serializer<T>> implements Iterable<T> {}

public class MonStats extends Excel<MonStats.Entry, MonStatsSerializer> {}

The first choice is easier to prototype as it doesn't require changing 50+ files just to test. The second choice is how it should probably be. Generated serializers are not ideal, but they should work until this whole system can be looked at and improved later on.

collinsmith commented 4 years ago

I rewrote the excel system under com.riiablo.excel, but this ended up being more work than I'd like to do at once. I'm going look into retrofitting some of the changes I made onto com.riiablo.codec.excel, but I think I'm going to go with the reflection and annotation approach I initially mentioned above and when I want to come back and improve this system later, implement the much better approach I began in com.riiablo.excel. Additionally, support for serializing some specific excels needs to be looked into, some classes like Runes and BodyLocs do some re-indexing that should be specifically tested.

This is definitely the preferred way to go and removes all of the dirty reflection techniques.

com.riiablo.excel spec

```java public abstract class Excel> implements Iterable { public static , T extends Excel> T load(T excel, FileHandle txt) { return load(excel, txt, null); } public static , T extends Excel> T load(T excel, FileHandle txt, FileHandle bin) {...} } public interface Serializer { void readBin(T entry, DataInput in) throws IOException; void writeBin(T entry, DataOutput out) throws IOException; } public class MonStats extends Excel { public MonStats() { super(Entry.class); } @Override public Entry newEntry() { return new Entry(); } @Override public Serializer newSerializer() { return new Serializer(); } public static class Entry extends Excel.Entry {} public static class Serializer implements com.riiablo.excel.Serializer { @Override public void readBin(Entry entry, DataInput in) {} @Override public void writeBin(Entry entry, DataOutput out) {} } } FileHandle txt = ...; FileHandle bin = ...; MonStats excel = Excel.load(new MonStats(), txt, bin); ```

Wrote some tests for monstats, runes and bodylocs, and everything looks good as far as excel entries -- need to test lookup tables though.

I incorporated the serialization and ran a test on android -- 19 seconds to 1.4 seconds.

collinsmith commented 4 years ago

I rewrote TXT as TxtParser to be a wrapper for a BufferedReader. This gives slightly better performance than TXT (on mobile) and removes much of the heavy-weightedness. Excels are now parsed line-by-line with EXPANSION row being automatically ignored -- this change results in only the actively read row's data being accessible by Excel.loadTxt (which is fine -- TXT was written to fit this use-case, but it isn't needed anymore).

collinsmith commented 3 years ago

I've been rewriting this module using techniques I've learned since creating it, since tons of the info in this is outdated and deprecated. The big improvement I've made is while rewriting TxtParser, I am using more byte-oriented data structures and readers and enforcing ASCII characters. No more expensive readLine() or String#split operations -- instead the InputStream is read until certain bytes are found (\t, \r, \n, -1), caching their locations and then the data can be parsed in-line using this buffer directly. Strings are returned as AsciiString references to the buffer and String is only generated when needed.

Exception handling is significantly more robust and error reporting far improved. I'm moving forward with final tests and working on the BinGenerator so I can merge and close this issue. The next iteration to use annotation processing will likely happen at a later time.

collinsmith commented 3 years ago

New design is being moved forward with and this issue has been rescoped within the context of #135

Closing this issue as it's being effectively handled by #135

collinsmith / riiablo

Excel class API changes #26