GoekeLab / sg-nex-data

Nanopore RNA-Seq data from the Singapore Nanopore-Expression Project
97 stars 24 forks source link

Raw data in BLOW5 format #30

Closed hasindu2008 closed 1 year ago

hasindu2008 commented 1 year ago

Hi

This is a very useful dataset, but sadly because they are stored as tar.gz archives there is no way to grab signals for particular read IDs, without downloading and extracting all the tar.gz files. Would you be able to host raw data in BLOW5 format, at least for one dataset to begin with?

jonathangoeke commented 1 year ago

Hi @hasindu2008 I think we could definitely do that for one dataset first. We could even host a short tutorial on how to use/query the BLOW5 format using this one file if you want to contribute that? see here for examples on short tutorials https://github.com/GoekeLab/sg-nex-data#data-analysis-tutorials-and-workflows

hasindu2008 commented 1 year ago

Thank you very much for being open to this. Which sample would you recommend? I can do the conversion and share the file with to host, and then I will play with and determine the best parameters for queries in an AWS dataset.

jonathangoeke commented 1 year ago

Sounds good! You could use this file here, which is one of the direct RNA-Seq samples: s3://sg-nex-data/data/sequencing_data_ont/fast5/SGNex_K562_directRNA_replicate4_run1/SGNex_K562_directRNA_replicate4_run1.tar.gz

hasindu2008 commented 1 year ago

I have converted that dataset and have done some sanity checks such as read counts, and uniqueness in read IDs and also have basecalled. What is the way you prefer for me to provide the file for you to upload to the AWS S3?

hasindu2008 commented 1 year ago

@jonathangoeke, just a ping in case you forgot this :)

cying111 commented 1 year ago

Hi @hasindu2008 , how big is the converted file in BLOW5 format? If it's not too big, we can provide you with a dropbox link so that you can share with us! Btw, could you share your email address as well so that we can send the dropbox link to you later?

hasindu2008 commented 1 year ago

@cying111 It is this SGNex_K562_directRNA_replicate4_run1/SGNex_K562_directRNA_replicate4_run1 that I converted as suggested and the converted size is around 50GB (originally 74G).

I temporarily uploaded it to my AWS S3 space and see if you can directly copy it to your s3 bucket? BLOW5 file and index: https://slow5test.s3.amazonaws.com/tmp/blow5/SGNex_K562_directRNA_replicate4_run1/

I could convert whole the dataset and provide the links like above if it is convenient.

If you are interested, here are the basecalls for that dataset from a recent Guppy 6.3.7 https://slow5test.s3.amazonaws.com/tmp/guppy_6.3.7_hac_fastq/SGNex_K562_directRNA_replicate4_run1/ If you think it is useful, I should be able to rebasecall the whole dataset conveniently after converting to BLOW5.

cying111 commented 1 year ago

Great! Could you send me (chen_ying@gis.a-star.edu.sg) the downloading paths for the BLOW5 file and index? The provided link is not usable I think.

For the new basecalled fastq file, could you also send me the downloading path as well? I will take a look at it and get back to you after that.

jonathangoeke commented 1 year ago

Many thanks @hasindu2008 @cying111! All files are now available as BLOW5 with the latest release v.0.4.0 #33

hasindu2008 commented 1 year ago

Great. Thanks for supporting this.