eth-mds / ricu

🏥 ICU data with R 🏥
https://eth-mds.github.io/ricu/
GNU General Public License v3.0
33 stars 11 forks source link

exceed 2^31-1 bytes #64

Open partizanos opened 2 months ago

partizanos commented 2 months ago

Hello I try to use ricu with sic dataset however I face this issue (below) any ideas?

sic$laboratory
Data for `sic` is missing
Setup now (Y/n)? Y
The requested tables have already been downloaded
── Importing 8 tables for `sic` ───────────────────────────────────────────────────
Error in paste(do.call("c", msg), collapse = "\n") : 
  result would exceed 2^31-1 bytes
In addition: There were 50 or more warnings (use warnings() to see the first 50)
mcr1213 commented 2 months ago

I also have this issue with no solution yet. It seems to be specific when using import_src on the 2,15 GB data_float_h.csv.gz file from SICdb, all other datasets worked fine.

Some things I've tried:

Full traceback is included:

Screenshot 2024-04-21 at 18 40 16

Any other suggestions to try would be much appreciated.

manuelburger commented 2 months ago

The configuration files under inst/extdata/config/data-sources.json for the SICdb database with sic tag do not correctly reflect the most recent version, which is downloaded from Physionet. Configurations, which are mostly correct can be found here in a previous PR to integrate the database, but seem to have not been merged entirely: https://github.com/eth-mds/ricu/pull/30 to the current main branch.

The error message posted stems from the fact, that ricu or more specifically the read_csv_chunked function raises a warning for every single erroneous line, when importing the csv. The most problematic is the configuration for the data_float_h table, where in the current main branch here: https://github.com/eth-mds/ricu/blob/7f2cc42503e003f4aea388847232e4157b7fc8ea/inst/extdata/config/data-sources.json#L9740 the rawdata column is specified to be of type col_double. The database documentation here: https://www.sicdb.com/Documentation/Signal_Data clearly states, that this column is a binary data column compressing up to 60 floats into a single cell of the csv table, to keep the row count of the table manageable, while still providing up to a minute level of resolution for some variables. 60 compressed floats naturally do not cast well to a col_double and thus one gets a full error message for every single line of the entire data_float_h table, this error messages are all concatenated by ricu instead by this function report_problems here, concatenating this many error messages blows the R string size of 2^31-1 bytes, which explains the error message.

Interestingly there's a second report_problems function just above the first one here, which would handle this problem by only reporting the 10 first issues and ignoring the rest, well, since it's listed first in source code, the second function will ultimately be used and thus all messages are propagated at the moment.

Potential fix is:

Hope this helps

mcr1213 commented 2 months ago

@manuelburger Thank you so much for the clear explanation. I've removed the redundant 'report_problems' function and changed the rawdata column from col_double to col_character in the config file. However, after 31% another error occurs:

Screenshot 2024-04-27 at 19 39 08

Probably this has to do with the changes you mentioned in #30 which are not merged with the main branch. Is there any particular reason that these changes are not available? Or is it only me for which SICdb 1.0.6 is not working in ricu?

mcr1213 commented 2 months ago

So short update, I've taken the branch mentioned in #30 as created by @prockenschaub and recompiled the ricu package (the older 0.5.5 version that is) and tried with this to add SICdb. The previous error does not occur, however after importing 86% a new one does:

Screenshot 2024-04-29 at 19 21 41

I've tried tracing back the code to see if there was an obvious explanation, but could not find one. It is not clear to me what function res should be.

Is there anyone with a working SICdb environment? And could they tell me which codebase they used?

prockenschaub commented 2 months ago

@mcr1213 I originally meant to work with SICdb when it was released but this has been pushed back repeatedly, so I haven't touched the code in a while. I originally thought that SICdb was fully integrated in ricu 0.6.0 and there was no need for my code, but apparently not.

Since there appears to be increased interest in SICdb, maybe now is a good time to look at it again. I will try to find some time in the coming days to look at your error and see what's wrong / how we can bring the code into the latest version of ricu and SICdb.

Edit: I had a quick look. res should be the function sic_data_float_h as defined in data-sources.jsan:

"callback": "sic_data_float_h"

mcr1213 commented 1 month ago

@prockenschaub Thanks for your suggestion. Unfortunately, I'm no expert in debugging R-packages and it does not yet work for me. At the moment my hypothesis is that the mentioned 'sic_data_float_h' cannot be found. When doing ls("package:ricu") this function does not show up in the available functions. I do know that this function is placed in the new (compared to the original release) file "./R/callback-tb-R". Searches in google/chatgpt suggested mentioning the file in the main DESCRIPTION file, but the other files are not referenced there.

I've also tried to 'Reoxygenize' the package to recreate NAMESPACE, but no luck.

Can you tell me if I'm on the right track? Does the sicdb work for you?

dplecko commented 1 month ago

I will resolve this issue in the next version (i.e., in June). In the meantime, if this is an urgent matter for anyone, my suggestion is to simply perform manual conversion to fst. I am attaching below some (pretty raw) code that I used for converting the sic tables when I first accessed the data. This code could perhaps be helpful for anyone looking for a quick fix, until I resolve the issue properly.

First, I split the data_float_h table into chunks (since it is huge)

import csv, os

def split_csv_file(input_file, output_prefix, num_files):
    # Open the input CSV file
    with open(input_file, 'r') as file:
        # Create a CSV reader
        reader = csv.reader(file)

        # Read the header row
        header = next(reader)

        # Calculate the number of rows per file (excluding the header row)
        rows_per_file = (sum(1 for _ in reader) + num_files - 1) // num_files

        # Reset the file pointer to the beginning
        file.seek(0)

        # Split the CSV into smaller chunks
        chunk_index = 1
        for i, row in enumerate(reader):
            if (i % rows_per_file) == 0:
                # Open a new output file
                output_file = f"{output_prefix}_{chunk_index}.csv"
                with open(output_file, 'w', newline='') as output:
                    writer = csv.writer(output)
                    writer.writerow(header)  # Write the header row

                    # Write rows to the current chunk until desired size
                    for j in range(rows_per_file):
                        try:
                            writer.writerow(next(reader))
                        except StopIteration:
                            break
                    print(f"Saved {output_file}")

                chunk_index += 1

input_path = os.path.expanduser("sic-data/data_float_h.csv")
split_csv_file(input_path, "output", 30)

And then all tables can be converted to fst


root <- rprojroot::find_root(".gitignore")
r_dir <- file.path(root, "r")
invisible(lapply(list.files(r_dir, full.names = TRUE), source))

library(fst)
library(ricu)

if (!dir.exists(file.path(data_dir(), "sic"))) 
  dir.create(file.path(data_dir(), "sic"))

convert_names <- c(
  "cases", "d_references", "data_range", "data_ref", "laboratory",
  "medication", "unitlog",
  "data_float_h"
)

data_path <- file.path("~", "Desktop", "sic-data")
if (is.element("data_float_hfull", convert_names)) {

  convert_names <- paste0(
    "data_float_h/",
    gsub(".csv", "", list.files(file.path(data_path, "data_float_h")))
  )
}

for (tab_name in convert_names) {

  if (file.exists(file.path(data_path, paste0(tab_name, ".csv")))) {

    tbl <- read.csv(file.path(data_path, paste0(tab_name, ".csv")))
    # file.remove(paste0(tab_name, ".parquet"))

    if (grepl("data_float_h_", tab_name)) 
      tab_name <- gsub("data_float_h_", "", tab_name)

    if (tab_name == "microbiology") {

      off_col <- which(names(tbl) == "offset")
      names(tbl)[off_col] <- "Offset"
    }

    if (tab_name == "gcs") {

      tbl$Offset <- 0
    }

    write_fst(tbl, path = file.path(data_dir(), "sic", paste0(tab_name, ".fst")))

  }

  print(tab_name)
}

fix_rawdata <- which(
  vapply(
    1:30,
    function(i) {
      class(
        read.fst(file.path(data_dir(), "sic", "data_float_h", 
                           paste0(i, ".fst")))$rawdata
      )
    }, character(1L) 
  ) == "logical"
)

for (i in fix_rawdata) {

  lgl_out <- read.fst(file.path(data_dir(), "sic", "data_float_h", 
                                paste0(i, ".fst")))
  lgl_out$rawdata <- as.numeric(lgl_out$rawdata)

  write.fst(lgl_out, file.path(data_dir(), "sic", "data_float_h", 
                               paste0(i, ".fst")))
}

Once the fst files are properly named and located in a folder called sic within the directory given by ricu::data_dir(), there should be no further issues.

mcr1213 commented 1 month ago

Thanks for the help everyone! The tables can now be successfully imported.

partizanos commented 1 week ago

Happy to see active interest on the issue.

@dplecko I ran the Python and R code snippets and while I was able to generate the data_float_h in parts; however, when they moving inside data_dir() inside a folder data_float_h, are not recognized from rICU. Did you merge the chunked output files into one somehow, or does the folder with the 15 .fst files suffice? Is there a timeline for the fix ( I saw an upcoming ricu v0.6.1 but not sure if sic handling will be included)? @mcr1213 I am glad to hear, did you go for the other branch? @mcr1213 which solution worked for you?

Thank you in advance for your help and active maintenance of the repository.

mcr1213 commented 1 week ago

@partizanos I'm afraid I had to do some combination of all the solutions provided. I'm not exactly sure which step was crucial to result in a working sicdb. The script above I used to unpack the data. In the end I ended up with a single data_float_h.fst file that worked.

I guess that multiple .fst files in dir data_float_h should work too, as other datasets use the same structure.

dplecko commented 1 week ago

@partizanos here is how the tables are organized for me in the data_dir() location.

data_float_h_layout

The folder should be called data_float_h and inside you should have files that are called 1.fst, 2.fst, and so on (the exact number of chunks should not matter). If you have this setup, but the loading is not working, I would be quite surprised, and would ask you for further details on what exactly is causing the issue.

A proper fix for all of this will happen some time this summer in ricu 0.6.2.