Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 976 forks source link

fread fails with unhelpful error message if data is gzipped and to large to fit into /tmp #5095

Closed ningwei-wei closed 2 years ago

ningwei-wei commented 3 years ago

hi, when i load my data, i meet some wrong that i dont't know how to solve

> methydata<-fread("~/TCGA/Methylation450K/jhu-usc.edu_PANCAN_HumanMethylation450.betaValue_whitelisted.tsv.synapse_download_5096262.xena.gz",data.table = F,nrows=396065)
Avoidable 2.493 seconds. This file is very unusual: it ends abruptly without a final newline, and also its size is a multiple of 4096 bytes. Please properly end the last row with a newline using for example 'echo >> file' to avoid this  time to copy.

can you help me?

avimallu commented 3 years ago

As far as I can make out, the file is reading correctly. data.table is simply telling you that the file ends in an unusual way, and that it takes time to identify this irregularity and fix it. Without the original file, all that I can say is that the file should end with a newline i.e. the character you get when you hit enter to start a new line (that is not printed/visible).

I doubt there is anything for you to fix here, unless you are getting something unexpected.

jangorecki commented 2 years ago

I think answer above is pretty well answering the question. If you have more questions about this particular issue let us know, we can always reopen issue if there will be something to improve.

Sabor117 commented 1 year ago

Apologies for reviving this old thread but I've also encountered this a few times now and very weirdly it seems to only be occurring after compressing a file.

I was able to fread my file perfectly fine beforehand but following compressing the file with gzip it now always throws this warning.

As you say, I don't think its necessarily a problem, but it is a bit concerning to see the warning every time I read the file.

ben-schwen commented 1 year ago

@Sabor117 How do decompress your file? Do you that on yourself or do you let data.table handle the decompressing? Which data.table version are you using? If data.table is handling the decompression we can might fix this if you provide us with a reprex.

Sabor117 commented 1 year ago

I just use fread() like this:

ukbb_scores_1 = fread("st03_01_scores.eqtlgen_ukb_prscs_ukb.tsv", data.table = FALSE)

And this is using data.table_1.14.6. Unfortunately I'm not sure how I can really provide anything reproducible for this as the data file in question is massive (16GB uncompressed) and also not something I can share anyway.

avimallu commented 1 year ago

Without a reproducible example, it's really hard to give suggestions. In addition - you've shared a snippet loads a .tsv file and not a compressed one; we might be able to help if you provide/show that code also.

yolololo-huang commented 1 year ago

sry to reviving this old thread again!I encounter the same error report when read in a compressed file, and cannot read even I decompressed it. ae23a0c38bf79948b8630a0f5b2afa2

BastienFR commented 9 months ago

For completeness, I've lived exactly the same problem with big compress files. The problem was that the decompression process saves a file in the /tmp folder of the os (in my case, Ubuntu) and it was too small to handle all the data, stopping halfway through. Increase the size of your /tmp could fix your problem.

For that I used:
sudo mount -o remount,exec,size=40gb /tmp but it may not be appropriate in your specific case, please check the proper command yourself.

The problem I think is that the error message is note very related to the problem on hand. Troubleshooting this bug would benefit from a more appropriate error message.

ben-schwen commented 9 months ago

@BastienFR Thanks. As usual it would be nice to post the output of using options(datatable.verbose=TRUE). This makes it finding the correct error message a lot easier

BastienFR commented 9 months ago

@ben-schwen

I've rerun my things to get the required information. Hopefully it's satisfactory. Please note that I work in a very tight corporate environment, so I can only produce limited quality print screens.

Output with options(datatable.verbose=FALSE) and /tmp=5gb

image003

Notice the message about the bad ending (caused by a fail of the decompression). At posteriori, the warning about problem writing to connection could have ring a bell.

Output with options(datatable.verbose=TRUE) and /tmp=5gb

image002

Note that the top of the output is truncated (didn't fit on my screen), but it's the same as the next print screen). Notice also the it read a 5.000GB with 40832623 rows, while the original file longer than that (see below). Also, when datatable.verbose=TRUE, with don't see the message aout the file ending problem.

Output with options(datatable.verbose=TRUE) and /tmp=40gb

image001

This worked. Note that the original file is 19.23GB with 157014112 rows

jan-glx commented 7 months ago

Given @BastienFR detailed report I think this issue should renamed to "fread fails with unhelpful error message if data is gzipped and to large to fit into /tmp" reopened. Might also belong to @HenrikBengtsson 's https://github.com/HenrikBengtsson/R.utils see also: https://www.linkedin.com/pulse/trivial-fix-after-3-hours-debugging-kirill-tsyganov