fsprojects / FSharp.Data

F# Data: Library for Data Access
https://fsprojects.github.io/FSharp.Data
Other
806 stars 288 forks source link

Memory Limitations in XmlProvider #1501

Open ibrahim324 opened 6 months ago

ibrahim324 commented 6 months ago

I tried to parse a dump of some wikipedia pages with XmlProvider, but no matter what I try, I get a System.OutOfMemoryException. Is there some guidance/pattern on how to parse large files with type providers? The file is almost exactly 2 GB large.

my code:

#r "nuget: FSharp.Data"
open FSharp.Data

open System
open System.IO

type Wiki = XmlProvider<"""data/wikidata_sample.xml""">

let xmlFromFile = 
    task{
        let path = "data/wikidata.xml" 
        let! text = File.ReadAllTextAsync(path)

        Wiki.Parse(text).Pages
        |> Array.map (fun f -> f.Revision.Text)
        |> Array.iter (fun f -> printfn $"{f}")
    }

let xmlFromStream = 
    let options = 
        new FileStreamOptions(BufferSize=32)
    use stream = new FileStream("data/wikidata.xml", options)
    stream 
    |> Wiki.Load
    |> fun f -> f.Pages
    |> Array.map (fun f -> f.Revision.Text.Value)
    |> Array.iter (fun f -> printfn $"{f}")

xmlFromStream

// xmlFromFile 
// |> Async.AwaitTask
// |> Async.RunSynchronously
cartermp commented 6 months ago

Can you post the stack trace when this happens?

ibrahim324 commented 6 months ago

this is the complete error message (copied from fsi):

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Text.StringBuilder.ToString()
   at FSharp.Data.Runtime.BaseTypes.XmlElement.Create(TextReader reader) in D:\a\FSharp.Data\FSharp.Data\src\FSharp.Data.Xml.Core\XmlRuntime.fs:line 59
   at <StartupCode$FSI_0003>.$FSI_0003.main@() in /Users/halilibrahimozcan/source/projects/fsharp_xml_parsing/script.fsx:line 25
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)
Aufgrund eines Fehlers beendet

Any other way to retrieve info about the error?

cartermp commented 6 months ago

Thanks! In this case it seems the size of the string is too big, as this is failing with the internal StringBuilder used in the XML Reader. Are you running in a 32-bit process?

ibrahim324 commented 6 months ago

@cartermp No, I have not configured fsi in any way. I'm running on MacOS if that makes a difference.

Thorium commented 4 months ago

Does it matter if the source file encoding is UTF8 or UTF16 ?

ibrahim324 commented 4 months ago

@Thorium Can you point to where I should set the encoding? I tried the following: let text = File.ReadAllText(path, Encoding.UTF32) which didn't work, unfortunately. UTF16 wasn't available either.

Thorium commented 4 months ago

I meant if you have the file as XML, if it's UTF16 then consider converting it to UTF8 to use less memory, e.g. Notepad++ tells you: image

ibrahim324 commented 4 months ago

@Thorium Hi, I just opened it in Notepad++; The file was encoded in UTF-8 to begin with.

dsyme commented 3 months ago

Try using fsiAnyCpu - fsi runs 32-bit by default

ibrahim324 commented 3 months ago

@dsyme That doesn't seem to be the issue - i ran the script within Rider, which is anycpu by default as I checked. I also ran a console program, but it's still an OutOfMemoryException.