Problems following the readme example

MardahlM commented 3 years ago

The pipeline seems to be working, but I don't get the same output.

I used the sMETASeq_NEXTFlex.sh for D12_S12_R1.fastq file only (due to limited space on my laptop). My commands look like this (I am on a mac, hence gsed instead of sed):

cutadapt -a TGGAATTCTCGGGTGCCAAGG -u 4 -m 1 -o D12_S12_R1.cut.fastq D12_S12_R1.fastq.gz

cutadapt -u -4 -m 1 -o D12_S12_R1.cut_trimmed.fastq D12_S12_R1.cut.fastq

bowtie2 -p20 -k10 -x GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index -q D12_S12_R1.cut_trimmed.fastq --un D12_S12_R1.Unmapped.cut.fastq -S D12_S12_R1.mapped.cut.sam

htseq-count -i Name -t transcript D12_S12_R1.mapped.cut.sam homo_sapiens.GRCh38.chr.gff3 > D12_S12_R1.RNACentral.csv

htseq-count -i Name -t miRNA D12_S12_R1.mapped.cut.sam hsa.gff3 > D12_S12_R1.miRBase.csv

gsed -n '1~4s/^@/>/p;2~4p' D12_S12_R1.Unmapped.cut.fastq > D12_S12_R1.Unmapped.cut.fasta

kraken2 --db ./kraken2_DB --threads 16 --report D12_S12_R1.kraken_report.csv --use-mpa-style --output D12_S12_R1.kraken_mpa.csv D12_S12_R1.Unmapped.cut.fasta

I don't get the same output as in the exampleOutputs provided. Here are tails of *RNAcentral.csv outputs for example followed by the tail of the RNAcentral output for the example dataset.

$ tail D12_S12_R1.RNACentral.csv URS000197B90D_9606 0 URS000197B963_9606 0 URS000197BB5F_9606 0 URS000197BDD2_9606 0 URS000197C5AD_9606 0 no_feature 543420 ambiguous 4948781 too_low_aQual 8065137 __not_aligned 721111 alignment_not_unique 0

exampleOutput $ tail D12_S12_R1.RNACentral.csv URS000197B90D_9606 0 URS000197B963_9606 0 URS000197BB5F_9606 0 URS000197BDD2_9606 0 URS000197C5AD_9606 0 no_feature 1321497 ambiguous 0 too_low_aQual 0 __not_aligned 317399 alignment_not_unique 0

My output has a lot of ambigous and too-low-aQual counts, what could this mean? Is there a flaw in the cutadapt or the alignment? I followed the descriptions in the readme. In the manuscript a fastx_collapser is mentioned so I tried running that, but that seems to make the counts smaller after htseq.

Any suggestions on how I can get the pipeline to work so I can process my +50 samples:-) ??

Cheers

MardahlM commented 3 years ago

upon including the parameters "-a 0 -s yes" to htseq-count I get the following tail on the RNAcentral.csv output:

$ tail D12_S12_R1.RNACentral.csv URS000197B90D_9606 1 URS000197B963_9606 0 URS000197BB5F_9606 0 URS000197BDD2_9606 0 URS000197C5AD_9606 0 no_feature 4070917 ambiguous 8958038 too_low_aQual 0 __not_aligned 721111 alignment_not_unique 0

to be clear this is from running the code:

for FILE in ${files[@]}; do

cutadapt  -a TGGAATTCTCGGGTGCCAAGG -u 4 -m 1 -o $FILE.cut.fastq $FILE.fastq.gz
cutadapt -u -4 -m 1 -o $FILE.cut_trimmed.fastq $FILE.cut.fastq
bowtie2 -p10 -k10 -x GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index -q $FILE.cut_trimmed.fastq --un $FILE.Unmapped.cut.fastq -S $FILE.mapped.cut.sam
htseq-count -a 0 -s yes -i Name -t transcript $FILE.mapped.cut.sam homo_sapiens.GRCh38.chr.gff3 > $FILE.RNACentral.csv
htseq-count -a 0 -s yes -i Name -t miRNA $FILE.mapped.cut.sam hsa.gff3 > $FILE.miRBase.csv
gsed -n '1~4s/^@/>/p;2~4p' $FILE.Unmapped.cut.fastq > $FILE.Unmapped.cut.fasta
kraken2 --db ./kraken2_DB --threads 16   --report $FILE.kraken_report.csv --use-mpa-style --output $FILE.kraken_mpa.csv $FILE.Unmapped.cut.fasta;

done

how am I getting so many ambiguous reads when you have none in the exampleOutput?

cheers

MjelleLab commented 3 years ago

Hi, are you using the correct species? Have you performed a fastQC before and after cutadapt, to check your sequences? Best,

Robin Mjelle PhD Department of Cancer Research and Molecular Medicine Faculty of Medicine Laboratory Centre, 5th floor. Erling Skjalgssons gt. 1 Norwegian University of Science and Technology NTNU 7030 Trondheim, Norway Cell. phone: +47 40234780

On Tue, May 4, 2021 at 8:50 AM MardahlM @.***> wrote:

upon including the parameters "-a 0 -s yes" to htseq-count I get the following tail on the RNAcentral.csv output:

$ tail D12_S12_R1.RNACentral.csv URS000197B90D_9606 1 URS000197B963_9606 0 URS000197BB5F_9606 0 URS000197BDD2_9606 0 URS000197C5AD_9606 0 no_feature 4070917 ambiguous 8958038 too_low_aQual 0 __not_aligned 721111 alignment_not_unique 0

to be clear this is from running the code:

for FILE in ${files[@]}; do

cutadapt -a TGGAATTCTCGGGTGCCAAGG -u 4 -m 1 -o $FILE.cut.fastq $FILE.fastq.gz cutadapt -u -4 -m 1 -o $FILE.cut_trimmed.fastq $FILE.cut.fastq bowtie2 -p10 -k10 -x GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index -q $FILE.cut_trimmed.fastq --un $FILE.Unmapped.cut.fastq -S $FILE.mapped.cut.sam htseq-count -a 0 -s yes -i Name -t transcript $FILE.mapped.cut.sam homo_sapiens.GRCh38.chr.gff3 > $FILE.RNACentral.csv htseq-count -a 0 -s yes -i Name -t miRNA $FILE.mapped.cut.sam hsa.gff3 > $FILE.miRBase.csv gsed -n '1~4s/^@/>/p;2~4p' $FILE.Unmapped.cut.fastq > $FILE.Unmapped.cut.fasta kraken2 --db ./kraken2_DB --threads 16 --report $FILE.kraken_report.csv --use-mpa-style --output $FILE.kraken_mpa.csv $FILE.Unmapped.cut.fasta;

done

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MjelleLab/sMETASeq/issues/2#issuecomment-831720455, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHLPZ6X6NE7VZPLY6HUFJADTL6KM7ANCNFSM44BQV7VQ .

MardahlM commented 3 years ago

Followed the guidelines on your github exactly. So I used Homo sapiens reference genome as you uploaded it. I didn’t perform fastQC because it wasn’t specified in the script? Should I? I am using your example data D12.... to get the pipeline working.

MardahlM commented 3 years ago

Hi, I checked the D12_S12_R data before and after both cutadapt steps as well as the .sam file. They are not the best quality with duplication levels and overrepresented sequences. However, this may be something to expect with small RNA-seq. Can you upload the fastq and sam files for D12_S12_R so that I can pinpoint where my analysis is going wrong? or are you able to pinpoint where I am going wrong? Best,

MjelleLab commented 3 years ago

I will look at it as soon as possible. The fastq data should be available.

ons. 5. mai 2021, 20:57 skrev MardahlM @.***>:

Hi, I checked the D12_S12_R data before and after both cutadapt steps as well as the .sam file. They are not the best quality with duplication levels and overrepresented sequences. However, this may be something to expect with small RNA-seq. Can you upload the fastq and sam files for D12_S12_R so that I can pinpoint where my analysis is going wrong? or are you able to pinpoint where I am going wrong? Best,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MjelleLab/sMETASeq/issues/2#issuecomment-832931996, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHLPZ6WRJTYSHU3JHHDEFFTTMGIKRANCNFSM44BQV7VQ .

MardahlM commented 3 years ago

Thanks a lot. Having read more into expected fastqc results from small RNA-seq the processed data looks perfectly normal for such a library. The raw fastq file is available and the csv output files. I am grateful if you can re-run the analysis and send me the in-between fastqs and then I promise a sweet citation:-D

MjelleLab commented 3 years ago

Could you please send me the 400 first sequences of a fastq file?

ons. 5. mai 2021, 21:35 skrev MardahlM @.***>:

Thanks a lot. Having read more into expected fastqc results from small RNA-seq the processed data looks perfectly normal for such a library. The raw fastq file is available and the csv output files. I am grateful if you can re-run the analysis and send me the in-between fastqs and then I promise a sweet citation:-D

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MjelleLab/sMETASeq/issues/2#issuecomment-832954728, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHLPZ6RNPJD7ARLAFCKYP3DTMGMYHANCNFSM44BQV7VQ .

MardahlM commented 3 years ago

I just realized I haven't been clear enough. I am using your data that I downloaded (I am trying to verify the pipeline before using my own data). The file is called: D12_S12_R.fastq.gz

MardahlM commented 3 years ago

Here are the first (head) 400 entries (1600 lines from D12_S12_R.fastq.gz file. 400entries1600lines_D12_S12_R.fastq.gz I appreciate any help :-)

MjelleLab / sMETASeq