Larger xml file takes very long to decode

CoreOffice / XMLCoder

Easy XML parsing using Codable protocols in Swift

https://coreoffice.github.io/XMLCoder/

MIT License

795 stars 107 forks source link

Larger xml file takes very long to decode #198

Open MartinP7r opened 4 years ago

MartinP7r commented 4 years ago

Hello there and thanks for providing this great package!

I have a 108mb (language dictionary with just under 5 million rows) xml file that takes over an hour to decode. At peak the memory consumption is over 2GB (according to debug navigator).

It will finish decoding eventually (I actually did let it run for an hour once). Smaller sections of the file decode just fine.

I wonder if you've experienced something similar and if there's any easy solution.
I'm wondering if splitting up the original file into smaller chunks. Which would maybe help if you're parsing the xml as DOM under the hood, but probably not so much if it's SAX?

MaxDesiatov commented 4 years ago

Hi @MartinP7r, thank you for reporting this.

I think that splitting the file is the best way to make it faster short term. Also, if your XML has deep nesting, that may slow it down too. I'd be very interested to see profiling results if you can build in Xcode and profile it with Instruments.

We use the standard XMLParserDelegate type from Foundation under the hood. Profiling would help in determining how much time is spent in Foundation's code as opposed to the XMLCoder code and would allows us to decide what exactly should be optimized at first.

MartinP7r commented 4 years ago

Thank you for your quick response!
I'll try and find out where all that time is spent. Probably towards the weekend, since it's a personal project I'm working on.
I'm not too experienced with Instruments, so thanks for the link! If you have any suggestions what else too look for, please let me know.

MartinP7r commented 4 years ago

I've let it run for a while

1.68 min  100.0%    0 s     Japanese (1117)
1.68 min   99.9%    0 s      Main Thread  0x58097
28.67 s   28.4% 28.67 s       specialized _ArrayBufferProtocol._arrayOutOfPlaceUpdate(_:_:_:_:)
24.42 s   24.2% 24.42 s       XMLCoderElement.transformToBoxTree()
12.74 s   12.6% 0 s       XMLStackParser.parse(with:errorContextLength:shouldProcessNamespaces:)
12.74 s   12.6% 0 s        static XMLStackParser.parse(with:errorContextLength:shouldProcessNamespaces:trimValueWhitespaces:)
12.74 s   12.6% 0 s         XMLDecoder.decode<A>(_:from:)
8.72 s    8.6%  8.72 s        XMLStackParser.process(string:)
8.72 s    8.6%  0 s        specialized XMLStackParser.parser(_:foundCharacters:)
6.80 s    6.7%  6.80 s        @objc XMLStackParser.parser(_:didStartElement:namespaceURI:qualifiedName:attributes:)
6.67 s    6.6%  0 s        XMLStackParser.parse(with:errorContextLength:shouldProcessNamespaces:)
6.67 s    6.6%  0 s         static XMLStackParser.parse(with:errorContextLength:shouldProcessNamespaces:trimValueWhitespaces:)
6.67 s    6.6%  0 s          XMLDecoder.decode<A>(_:from:)
114.00 ms    0.1%   0 s        @objc XMLStackParser.parser(_:didStartElement:namespaceURI:qualifiedName:attributes:)
114.00 ms    0.1%   0 s         XMLStackParser.parse(with:errorContextLength:shouldProcessNamespaces:)
114.00 ms    0.1%   0 s          static XMLStackParser.parse(with:errorContextLength:shouldProcessNamespaces:trimValueWhitespaces:)
114.00 ms    0.1%   0 s           XMLDecoder.decode<A>(_:from:)
10.00 ms    0.0%    0 s        XMLStackParser.stack.modify
4.05 s    4.0%  4.05 s        specialized XMLStackParser.parser(_:foundCharacters:)
1.80 s    1.7%  1.80 s        @objc XMLStackParser.parser(_:foundCharacters:)
1.17 s    1.1%  1.17 s        specialized UnsafeMutablePointer.initialize(from:count:)
1.11 s    1.1%  1.11 s        specialized _ContiguousArrayBuffer.init(_uninitializedCount:minimumCapacity:)
1.10 s    1.0%  0 s       static String.+ infix(_:_:)

this is with hidden system libraries, I think Foundation related information is hidden by this? Which makes it somewhat useless. 😅

Not hiding them makes everythin really hard to read.

1.68 min   99.9%    0 s      Main Thread  0x58097
21.92 s   21.7% 21.92 s       swift_release
21.51 s   21.3% 21.51 s       swift_retain
4.59 s    4.5%  4.59 s        swift::metadataimpl::ValueWitnesses<swift::metadataimpl::OpaqueExistentialBox<1u> >::initializeWithCopy(swift::OpaqueValue*, swift::OpaqueValue*, swift::TargetMetadata<swift::InProcess> const*)
2.46 s    2.4%  2.46 s        objc_msgSend
2.43 s    2.4%  2.43 s        _CFStringCheckAndGetCharacterAtIndex
2.04 s    2.0%  2.04 s        initializeWithCopy for Character
1.69 s    1.6%  1.69 s        _platform_memmove
1.60 s    1.5%  1.60 s        _CFStringGetCStringPtrInternal
1.54 s    1.5%  1.54 s        _CFRetain
1.47 s    1.4%  1.47 s        _CFRelease
1.38 s    1.3%  1.38 s        __CFStringCreateImmutableFunnel3
1.29 s    1.2%  1.29 s        _kernelrpc_mach_vm_deallocate_trap
1.18 s    1.1%  1.18 s        swift::OpaqueValue* tuple_initializeWithCopy<false, false>(swift::OpaqueValue*, swift::OpaqueValue*, swift::TargetMetadata<swift::InProcess> const*)
1.17 s    1.1%  1.17 s        tiny_malloc_from_free_list
1.12 s    1.1%  1.12 s        void tuple_destroy<false, false>(swift::OpaqueValue*, swift::TargetMetadata<swift::InProcess> const*)

If the .trace file for instruments would help, I can share it via google drive.

as screenshots: hidden sys libs: Instruments

with sys libs: Instruments_and_cyrus_cyrus-mac____code_ios_projects_JapaneseDictionary_Japanese

MaxDesiatov commented 4 years ago

Thank you, that's super helpful! I think the last screenshot shows that the majority of time is spent in the JMdict.retrieveDict function, which then calls into SettingsDictRow.body.getter. I'd recommend looking into the source code of those functions, and maybe profiling them separately from XMLCoder? Could it be the case that you repeatedly call them, and those functions run some heavy code over and over again? Or maybe there's some blocking file I/O they do that could be done asynchronously? Looks like they are calling into SwiftUI and its gesture handling functions, maybe there's some SwiftUI state that you capture and inadvertently update?

I'd also recommend profiling without any UI involved, especially without the SwiftUI bits that we can see in the profiler screenshot. I hope that the code calling into XMLCoder can be isolated into a self-contained command-line app or test case executed with XCTest to be profiled separately?

Either way, it seems like the hot path is somewhere there, but not in XMLCoder. I may be wrong, but I can't verify it without looking at the actual code unfortunately.

MartinP7r commented 4 years ago

I think you might have it backwards since in this view the call tree is inverted. Sorry I did not mention that, I used the settings as described in the tutorial you recommended above:

I'd be happy to isolate the profiling from UI logic to make sure. I'll try and write a test on Wednesday.

MartinP7r commented 4 years ago

Btw, all the retrieveDict() method does is:

    static func retrieveDict() -> JMdict? {
        guard let url = Bundle.main.url(forResource: "JMdict",
                                        withExtension: "xml") else { fatalError() }
        guard let data = try? Data(contentsOf: url) else { fatalError() }

        let decoder = XMLDecoder()
        let jmdict: JMdict?

        do {
            jmdict = try decoder.decode(JMdict.self, from: data)
        } catch {
            l.error(error)
            jmdict = nil
        }

        return jmdict
    }

Looks like they are calling into SwiftUI

Just to clarify again, the call tree is inverted, so Swift UI is calling retrieveDict, once only.

MaxDesiatov commented 4 years ago

I think the logic in KeyedStorage looks like the hottest path here, and it's needed to make sure that order of elements is preserved when encoding. Maybe we could could optimize parts of it, or maybe make it configurable to allow turning it off, but this would require some time to develop and to make sure that existing test cases aren't broken.

Do you plan to release your project or at least parts of it as open source? As far as I understand, the JMdict file is freely available, would you be able to share the JMdict model structure you use here? This would allow me to create a test case or an isolated app to profile it on my side.

MartinP7r commented 4 years ago

Thank you and sorry for the late reply.

needed to make sure that order of elements is preserved when encoding.

I'm not sure I fully understand. How does this come into play when trying to decode, not encode?

I can share what I have so far in terms of models. It's still incomplete and doesn't cover all of the available tags yet, though.

MartinP7r commented 4 years ago

struct JMdict: Codable {

    var entries: [Entry]

    enum CodingKeys: String, CodingKey {
        case entries = "entry"
    }
}

struct Entry: Codable {

    var ent_seq: Int = 0
    var kanjis: [Kanji]? = []
    var readings: [Reading] = []
    var senses: [Sense]

    enum CodingKeys: String, CodingKey { case
        ent_seq = "ent_seq",
        kanjis = "k_ele",
        readings = "r_ele",
        senses = "sense"
    }
}

struct Kanji: Codable, Hashable {
    var expression: String
    let info: String?
//    let priority: [String]

    enum CodingKeys: String, CodingKey { case
        expression = "keb",
        info = "ke_inf"
    }
}

struct Reading: Codable, Hashable {
    let expression: String
    let restrictedTo: String?
    let info: String?
    //    let priority: [String]

    enum CodingKeys: String, CodingKey { case
        expression = "reb",
        restrictedTo = "re_restr",
        info = "re_inf"

    }
}

struct Sense: Codable, Hashable {

//    var stagk: String?
    let partOfSpeech: String?   
    let crossReference: String?
    let antonym: String?
    let field: String?
    let misc: String?           
    let languageSource: String? 
    let dialect: String?
    var glosses: [Gloss]

    enum CodingKeys: String, CodingKey { case
//        stagk,
        partOfSpeech = "pos", 
        crossReference = "xref",
        antonym = "ant",
        field,
        misc,
        languageSource = "lsource",
        dialect = "dial",
        glosses = "gloss"
    }
}

struct Gloss: Codable, Hashable {

    var value: String
    var type: String? 
    var lang: String = "eng"
    var gender: String?

    enum CodingKeys: String, CodingKey { case
        value = "",
        type = "g_type",
        lang = "xml:lang",
        gender = "g_gend"
    }
}

MaxDesiatov commented 4 years ago

I'm not sure I fully understand. How does this come into play when trying to decode, not encode?

I think the most direct way to get exposed to that is to decode into enums with something like XHTML and other similar types of XML documents:

<body>
  <p>Paragraph 1</p>
  <p>Paragraph 2</p>
  <span>Random text</span>
  <p>Paragraph 3</p>
<body>

A super naive way to decode this would be:

struct Body: Codable {
  let p: [String]
  let span: [String]
}

But with this model you'd never know that the <span> element comes after the first two <p> elements and before the third <p> element.

Then a more suitable model would be:

struct Body: Codable {
  enum Element: Codable {
    case p(String)
    case span(String)
  }

  let elements: [Element]

  // more special enum handling omitted for brevity
}

XMLCoder decodes (and symmetrically encodes) in two stages:

It builds a tree of boxes based on their types discovered dynamically from the result of Foundation's XMLParser. For collections of elements like in the example above it needs to preserve the order, but also preserve indices of corresponding keys to handle both types of models. At this stage we can't know whether you model type expects the order of elements to be preserved, so we preserve it just in case.
It maps the tree of boxes by mapping it to the tree of coders, supplying the subtrees to corresponding Decodable implementations of each model type. Again, from the perspective of XMLCoder the Decodable implementations of your model types is a black box. It can be a standard compiler-generated boilerplate, but can also be your own highly-customized logic.

Because so much happens in runtime, and Swift doesn't allows us to examine your model types at compile time (we'd probably need something like proper macros for that, and not C/C++ macros, but something closer to what Rust has), there doesn't seem to be a way to optimize this other than a user toggling this order preservation logic on their XMLDecoder instance. At least that's my current understanding, but maybe I'm missing something and I'd gladly consider some alternative solutions.

MartinP7r commented 4 years ago

Thank you so much for the detail explanation!
I think I understand now. Depending on the requirements of the model, essential context would get lost without keeping the order of different types of elements within the same level of the xml structure.

I've gone through the documenting comments in JMdict's header and so far I don't think there's any such case.
So, I'd really like to try and decode it without the logic for preserving element order to see if it has a significant effect on performance.

I'd try it myself locally in a fork, but I'm not sure about a lot of your implementation, yet.

If I understand it correctly, in XMLCoderElement.transformToBoxTree() the hot path goes through KeyedStorage.merge(element:) and KeyedStorage.append(:at:), whereas the later stores the order?

MaxDesiatov commented 4 years ago

Yes, that's correct. The elements themselves are stored in a buffer that fully preserves the order, but there's also a dictionary that maps from keys to indices in the buffer array. When elements are queried by key, you get them through that dictionary, as far as I remember.