BaseXdb / basex

BaseX Main Repository.
http://basex.org
BSD 3-Clause "New" or "Revised" License
661 stars 268 forks source link

C client Bug: ArrayIndexOutOfBoundsException when the file is larger than 2GB #2185

Closed MarcinBasiak closed 1 year ago

MarcinBasiak commented 1 year ago

Description of the Problem

The issue is 100% reproducible. I try to create the new database using the BaseX C API. The file size is greater than 2GB. If the file is less than 2GB the issue does not occur. https://github.com/BaseXdb/basex/tree/9/basex-api/src/main/c

xyz@cluster:~$ basexserver [warning] /usr/bin/basexserver: Unable to locate /usr/share/java/tagsoup.jar in /usr/share/java [warning] /usr/bin/basexserver: Unable to locate /usr/share/java/xml-resolver.jar in /usr/share/java [warning] /usr/bin/basexserver: Unable to locate /usr/share/java/jing.jar in /usr/share/java BaseX 9.0.1 [Server] Server was started (port: 1984). Exception in thread "Thread-4" java.lang.ArrayIndexOutOfBoundsException: Maximum array size reached. at org.basex.util.Array.newSize(Array.java:238) at org.basex.util.list.ElementList.newSize(ElementList.java:27) at org.basex.util.list.ByteList.add(ByteList.java:42) at org.basex.io.in.BufferInput.readBytes(BufferInput.java:137) at org.basex.server.ClientListener.run(ClientListener.java:105)

basex/basex-core/src/main/java/org/basex/util/Array.java

* Maximum array size (see {@code MAX_ARRAY_SIZE} variable in {@link ArrayList}). / public static final int MAX_SIZE = Integer.MAX_VALUE - 8;

/**

The issue has not have occurred if I used: https://docs.basex.org/wiki/Commands#CREATE_DB CREATE DB [name] ([input]) where input is the filename.

Question: Is it expected behavior? Does it mean that the BaseX will not support larger buffers than 2GB? How can I solve this issue? Should I modify the java code?

Expected Behavior

The C API should provide the same behavior as the command line interface. I don't want to save the received XML files to the disk because it's not optimal.

Steps to Reproduce the Behavior

  1. Create XML file larger than 2GB
  2. Run basexserver
  3. Build and run the code
  4. The server should not throw the exception ArrayIndexOutOfBoundsException: Maximum array size reached
  5. The DB should be created successfully

Do you have an idea how to solve the issue?

public final class Array { /* Maximum array size (see {@code MAX_ARRAY_SIZE} variable in {@link ArrayList}). / public static final long MAX_SIZE = 3826003525L;

/**

What is your configuration?

Distributor ID: Ubuntu Description: Ubuntu 20.04.5 LTS Release: 20.04 Codename: focal

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 32 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz Stepping: 4 CPU MHz: 2095.076 BogoMIPS: 4190.15 Hypervisor vendor: KVM Virtualization type: full L1d cache: 1 MiB L1i cache: 1 MiB L2 cache: 128 MiB L3 cache: 512 MiB NUMA node0 CPU(s): 0-31

ChristianGruen commented 1 year ago

Thanks for the observation.

In Java, both strings and arrays, which are used to cache the client request, are limited to 2^31 entries. It could potentially be possible to use a list of arrays for caching larger requests, but that would be a substantial change to the client architecture, and it would easily trigger OutOfMemory exceptions somewhere later.

One way out could be to add functions for streaming input, as e.g. realized in BaseXClient.java and in other client implementations.

ChristianGruen commented 1 year ago

@MarcinBasiak Many of our client bindings have been written, or are being maintained, by external contributors. Would you possibly feel comfortable to adding streaming support to the client?

MarcinBasiak commented 1 year ago

@ChristianGruen Thanks for the quick response. I have to confirm it with my manager but it will be possible to contribute and deliver this functionality if we decide to use BaseX in our product. I have three questions in the case of the BaseX: Is it presently supported functionality to upload to the server many buffers of one file? In that case, I mean upload 1GB buffer and in the next iteration 1GB and in the next iteration 1GB but each buffer has to be merged to one file on the server side?

I also have trouble finding the information about the database replication: Is it possible to configure the BaseX like Master and many Slave DB? Is it possible to configure the BaseX in Geo Redundancy mode Active Standby?

We are looking for a new DB for a big commercial project. We can provide the contributors from our company to deliver such functionality. To make it possible I need the above information. I will be glad for your response.

ChristianGruen commented 1 year ago

@ChristianGruen Thanks for the quick response. I have to confirm it with my manager but it will be possible to contribute and deliver this functionality if we decide to use BaseX in our product.

Nice to hear.

I have three questions in the case of the BaseX: Is it presently supported functionality to upload to the server many buffers of one file?

Some (basic) information on the server protocol is given in our documentation: The create(), add(), put() and putbinary() functions have been designed to send a byte stream to the server that may exceed 2 GB. The stream of bytes is not cached, but directly processed by the corresponding server operation. You could have a look at one of the other client implementations, or the ClientListener class that communicates with the client.

I also have trouble finding the information about the database replication: Is it possible to configure the BaseX like Master and many Slave DB? Is it possible to configure the BaseX in Geo Redundancy mode Active Standby?

We have realized custom replication solutions for our clients, but it’s not part of the BaseX Open-Source product.

ChristianGruen commented 1 year ago

@MarcinBasiak Do you have more questions on the current state of the client?

MarcinBasiak commented 1 year ago

Dear Christian, the suggested solution is working. We can input the large files to BaseX by put method. In that case, I think this is not the issue of BaseX. I think we can close it. Thanks for help.

MarcinBasiak commented 1 year ago

The described issue does not occur when we use the BaseX Put method and set intersperse flag true: https://docs.basex.org/wiki/Options#INTPARSE