GoekeLab / sg-nex-data

Nanopore RNA-Seq data from the Singapore Nanopore-Expression Project
105 stars 25 forks source link

two fastq files were not correctly formated #36

Open alexyfyf opened 1 year ago

alexyfyf commented 1 year ago

Hi team,

I have downloaded some cDNA fastq files from you s3 repo. I found 2 files are not correctly formatted when I run QC with NanoPlot.

SGNex_MCF7_cDNAStranded_replicate2_run1/SGNex_MCF7_cDNAStranded_replicate2_run1.fastq.gz
SGNex_K562_cDNAStranded_replicate3_run3/SGNex_K562_cDNAStranded_replicate3_run3.fastq.gz

The first one has additional strings before the @ character of the first read.

fastq_fail/FAK34234_679ea2e77287c6ea3bab84c69ca16d29e5d9c760_228.fastq000666 001750 001750 00010735421 13424777162 023424 0ustar00gridgrid000000 000000 @0185f0c7-c4a5-40fb-9ac2-6907653a86a5 runid=679ea2e77287c6ea3bab84c69ca16d29e5d9c760 read=46243 ch=61 start_time=2019-02-01T08:06:48Z flow_cell_id=FAK34234 protocol_group_id=010219_MCF7_mRNA_PCS109 sample_id=010219_MCF7_mRNA_PCS109
ACGGTAATACTTCGGTCTTGTTTCGACAATCGGTCGCTCAGACCGACCGTGGAAC
+
#"*%&$#%"$&"""""$&&#"""""""++*++)/+%#%##'+*$%&'%"##("&$

The second one has a read with an unmatching length of quality score.

@09f55d50-803e-4048-899d-bb2fbdbf9c33 runid=446e90283984afd70d3f9af90262644290c7fca2 read=1796 ch=64 start_time=2019-01-07T07:56:26Z flow_cell_id=FAK11042 protocol_group_id=070119_K562_mRNA_PCS109 sample_id=070119_K562_mRNA_PCS109
TCGGTGATAAAGTGTTAATCGTCGG
+
%"-$&%""""""""$"""""""""

Can you confirm this? Cheers, Alex

cying111 commented 11 months ago

Hi @alexyfyf ,

Thanks for pointing out the problems of those files.

I have corrected those two files and updated them in the S3 bucket. Please have a look.

Please let us know if issues are found for other files as well!

Thank you. Warm regards, Ying

alexyfyf commented 11 months ago

Hi Ying,

I did spot another file from dRNA also corruputed. SGNex_MCF7_directRNA_replicate2_run2

It has quite a few problems, and I used the following code to fix it.

zcat SGNex_MCF7_directRNA_replicate2_run2.fastq.gz | sed 's/.*@/@/g' | sed '$d' | gzip > SGNex_MCF7_directRNA_replicate2_run2_fixed.fastq.gz

You can have a look and see if there's a better way.

Cheers, Alex

cying111 commented 11 months ago

Hi Alex,

Thanks for the heads-up again and sharing your code for correcting that.

I think that's good already.

I have uploaded the corrected version just now.

Thank you Regards, Ying