Problems preprocessing COVID-19 Sample from Paper

yeredh commented 4 years ago

Hello,

I downloaded the FASTQ files for sample GSM4339771 (SRR11181956) from SRA in the original format from https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11181956

So I end up with two files

C143_R1.fastq.gz.1
C143_R2.fastq.gz.1

I was able to identify the cell barcodes with umi_tools

umi_tools whitelist --stdin C143_R1_test.fastq.gz  \
                    --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                    --set-cell-number=100 \
                    --log2stderr > whitelist.txt;

However, when I tried the next step; extracting the barcodes and UMIs and add to read names

umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
                  --stdin C143_R1.fastq.gz \
                  --stdout C143_R1_extracted.fastq.gz \
                  --read2-in C143_R2.fastq.gz  \
                  --read2-out=C143_R2_extracted.fastq.gz \
                  --filter-cell-barcode \
                  --whitelist=whitelist.txt;

I get the following error message

ValueError: 
Read pairs do not match
CL200152206L1C001R001_0/1 != CL200152206L1C001R001_0/2

What am I doing wrong?

Best,

Yered

Dragonlongzhilin commented 4 years ago

I guess that the ids are not consistent one-to-one match between read 1 and read2. You should check the fastq file.

PierreBSC commented 4 years ago

Hi Yered,

So basically you are doing it compli right and the problem comes from the files. UMI-tools has been designed to process fastq files produced by Illumina devices. The files you are mentionning have been generated by a BGI machine : therefore the headers are a bit different. This is problematic but can be solved. First you need to install a specific version of UMI-tools : https://github.com/CGATOxford/UMI-tools/tree/%7BTS%7D-IgnoreReadPairSuffix. You then need to modify the extract line as describe here : https://github.com/CGATOxford/UMI-tools/issues/325 and it should do the job !

Hope this will help,

Best

Pierre

yeredh commented 4 years ago

Thank you Pierre for your prompt reply!

PierreBSC / Viral-Track

Problems preprocessing COVID-19 Sample from Paper #9