BaseXdb / basex

BaseX Main Repository.
http://basex.org
BSD 3-Clause "New" or "Revised" License
661 stars 267 forks source link

Large file support #1402

Closed txjmb closed 7 years ago

txjmb commented 7 years ago

If a source XML file is larger than 2,147,483,647 bytes BaseX is unable to create a database, as it hits the maximum Java array size when reading the file for parsing and throws this error:

java.lang.OutOfMemoryError: Required array size too large at java.nio.file.Files.readAllBytes(Unknown Source) at org.basex.io.IOFile.read(IOFile.java:105) at org.basex.build.DirParser.parseResource(DirParser.java:211) at org.basex.build.DirParser.parse(DirParser.java:143) at org.basex.build.DirParser.parse(DirParser.java:104) at org.basex.build.DirParser.parse(DirParser.java:104) at org.basex.build.DirParser.parse(DirParser.java:93) at org.basex.build.Builder.parse(Builder.java:77) at org.basex.build.DiskBuilder.build(DiskBuilder.java:77) at org.basex.core.cmd.CreateDB.run(CreateDB.java:101) at org.basex.core.Command.run(Command.java:255) at org.basex.core.Command.execute(Command.java:93) at org.basex.api.client.LocalSession.execute(LocalSession.java:132) at org.basex.api.client.Session.execute(Session.java:36) at org.basex.core.CLI.execute(CLI.java:104) at org.basex.core.CLI.execute(CLI.java:88) at org.basex.BaseX.console(BaseX.java:187) at org.basex.BaseX.(BaseX.java:162) at org.basex.BaseX.main(BaseX.java:42) org.basex.core.BaseXException: Out of Main Memory.

For repro, I was trying to load the xml files in this file:

ftp://eqrdownload.ferc.gov/DownloadRepositoryProd/Bulk/XML/XML_2014_Q3.zip

ChristianGruen commented 7 years ago

Generally, it’s no problem to create databases from much larger XML documents (see http://docs.basex.org/wiki/Statistics for some examples). Could you please provide us with some more details on how you created the database?

ChristianGruen commented 7 years ago

Closed (feel free to reopen this if you have some more details).

txjmb commented 7 years ago

Christian,

Thanks for your reply. I was creating the database using the following commands from the unzipped version of the file above (which also has nested zipped files, that I pre-decompressed).

SET ADDARCHIVES false SET SKIPCORRUPT true SET INTPARSE true SET STRIPNS true SET TEXTINDEX false SET ATTRINDEX false SET TOKENINDEX false CREATE DB EQR_XML_2014_Q3 D:/Data/EQRSource/EQR_XML_2014_Q3

I am running BaseX command-line on a Windows 10 64-bit machine on Java 8. I debugged the error and it does appear to be a problem with the fact that the file contents are being loaded into a Java array all at once and since Java arrays are indexed by Ints, they are limited to the size of Int on any given platform.

Thank you for looking into this.

ChristianGruen commented 7 years ago

Thanks for the script, and sorry for letting you wait.

The SKIPCORRUPT option is the culprit: As the documentation states (and as you already indicated), documents will first be cached in main-memory before they will be checked. This usually speeds ups the process, and it is particularly required if the input is a stream.

Could you please try if you can create a database without this option?

Marceko commented 7 years ago

Hey Christian, I´ve got the same error txjmb already described. I can´t complete my import with a XML-File larger than 2 GB. Changing the option skipcorrupt to false has no effect for me. What else can I try? Is it possible not to load the whole file at once? A SAX-Parser should not lead to such a problem?! Splitting my XML-File is not really an option. Thanks in advance.

ChristianGruen commented 7 years ago

I’m wondering if it’s really the same problem, because files won’t be cached if SKIPCORRUPT is set to false (the code in question: https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/build/DirParser.java#L211-L226). As shown in our Statistics Wiki article, it’s possible to parse files up to 500 GB. Could you give us more hints on what you did, and how to reproduce the error you encountered?

Marceko commented 7 years ago

Hey, thanks for the response. I analyzed my xml file and there seems to be a problem probably caused by file transfer. For the moment I will skip the file, because all other files (which are way smaller) are ok.