UCSC-LoweLab / tRAX

tRNA Analysis of eXpression
GNU General Public License v3.0
8 stars 5 forks source link

KeyError: 'rRNA' during processsamples.py #22

Open abakirbas opened 1 year ago

abakirbas commented 1 year ago

Hi, I am a plant biologist interested in your tool tRAX. I have managed to build a database for the model plant species Arabidopsis thaliana and then move to the sample processing step. I get most of the plots but at some point the script ends with the following error:

Traceback (most recent call last): File "/opt/trax/processsamples.py", line 549, in gettraxqc(samplefilename, trnainfo, expinfo, tgirtmode = nofrag) File "/opt/trax/processsamples.py", line 241, in gettraxqc traxqc.main(samplefile=samplefile,databasename=trnainfo.dbname,experimentname=expinfo.expname,tgirt = tgirtmode, output=expinfo.qaoutputname) File "/opt/trax/traxqc.py", line 724, in main typeresults = checkreadtypes(samplename, sampleinfo, tgirtmode) File "/opt/trax/traxqc.py", line 487, in checkreadtypes rrnapercent = {currsample : typecounts.getrrnapercent(currsample) for currsample in samples} File "/opt/trax/traxqc.py", line 487, in rrnapercent = {currsample : typecounts.getrrnapercent(currsample) for currsample in samples} File "/opt/trax/traxqc.py", line 366, in getrrnapercent return self.typecounts[sample]["rRNA"] / (1.*self.gettotal(sample)) KeyError: 'rRNA'

I can see in my non-tRNA annotation file rRNAs are successfully extracted from Ensembl. But when I look at my -expname_typecounts.pdf file, the types are only shown as "other" and "tRNA" (see attached figure). and I don't get delivered some other plots as well such as scatter plots of the comparison data.

p.s. - I am running tRAX on an HPC environment as a singularity image. typecounts

andrewdholmes commented 1 year ago

Alright that's the QC step that failed, and I think because there was no rRNA in your sample which is one of the things it checks. You should have all the other results though. I can try to push something to fix the error, but yeah something is weird with your sequencing data there . If you haven't, I'd use something like fastqc to see if there's anything weird about your input data.

On Tue, Oct 25, 2022 at 8:56 AM Ahmet Bakirbas @.***> wrote:

Hi, I am a plant biologist interested in your tool tRAX. I have managed to build a database for the model plant species Arabidopsis thaliana and then move to the sample processing step. I get most of the plots but at some point the script ends with the following error:

Traceback (most recent call last): File "/opt/trax/processsamples.py", line 549, in gettraxqc(samplefilename, trnainfo, expinfo, tgirtmode = nofrag) File "/opt/trax/processsamples.py", line 241, in gettraxqc traxqc.main(samplefile=samplefile,databasename=trnainfo.dbname,experimentname=expinfo.expname,tgirt = tgirtmode, output=expinfo.qaoutputname) File "/opt/trax/traxqc.py", line 724, in main typeresults = checkreadtypes(samplename, sampleinfo, tgirtmode) File "/opt/trax/traxqc.py", line 487, in checkreadtypes rrnapercent = {currsample : typecounts.getrrnapercent(currsample) for currsample in samples} File "/opt/trax/traxqc.py", line 487, in rrnapercent = {currsample : typecounts.getrrnapercent(currsample) for currsample in samples} File "/opt/trax/traxqc.py", line 366, in getrrnapercent return self.typecounts[sample]["rRNA"] / (1.*self.gettotal(sample)) KeyError: 'rRNA'

I can see in my non-tRNA annotation file rRNAs are successfully extracted from Ensembl. But when I look at my -expname_typecounts.pdf file, the types are only shown as "other" and "tRNA" (see attached figure). and I don't get delivered some other plots as well such as scatter plots of the comparison data.

p.s. - I am running tRAX on an HPC environment as a singularity image. [image: typecounts] https://user-images.githubusercontent.com/39315279/197822885-575b3979-5719-419b-b73f-e1d5a9181168.png

— Reply to this email directly, view it on GitHub https://github.com/UCSC-LoweLab/tRAX/issues/22, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHEMLQBRDVGFUL4ZAJA7KDWE77MDANCNFSM6AAAAAAROD2KHY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

abakirbas commented 1 year ago

Thanks Andrew. I am using already trimmed fastq.gz files in my analysis and skipping the trimming step. Do you think that might have triggered the problem? Also, I don't think I have all the other results. Like I said in my message, I don't have the scatter plots of DEG, for example. I understand you are the expert but I think I have rRNAs and other sRNAs in my sequencing data because when I use another tool to map my reads to genomic features (see image attached), I see rRNAs clearly. Do you have any other ideas why this step is failing? plusFe_phloem_sRNA_genome_pie_chart

andrewdholmes commented 1 year ago

Ahh alright then, then it's probably not a problem with the reads. I would next check the ensembl GTF file used for non-tRNA genes and make sure that it uses the same chromosome names as the fasta file for the genome. If that's the case it's probably that the raw GTF file downloaded from ensembl doesn't use the same chromosome names as the gtrrnadb tRNAs, you can try something like:

awk '{print "chr" $0;}' <ensemblgene.gtf | sed 's/chrMT/chrM/g'

newensemblgene.gtf

and that could fix that for you. Lemme know if that fixes it for you

On Fri, Oct 28, 2022 at 6:30 AM Ahmet Bakirbas @.***> wrote:

Thanks Andrew. I am using already trimmed fastq.gz files in my analysis and skipping the trimming step. Do you think that might have triggered the problem? I understand you are the expert but I think I have rRNAs and other sRNAs in my sequencing data because when I use another to map my reads to genomic features (see image attached), I see rRNAs clearly. Do you have any other ideas why this step is failing? [image: plusFe_phloem_sRNA_genome_pie_chart] https://user-images.githubusercontent.com/39315279/198611315-4b580ca0-d236-4d72-bf55-91eb8053ed04.png

— Reply to this email directly, view it on GitHub https://github.com/UCSC-LoweLab/tRAX/issues/22#issuecomment-1295003073, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHEMLT4TH34O72HVR3NWRLWFPINTANCNFSM6AAAAAAROD2KHY . You are receiving this because you commented.Message ID: @.***>

abakirbas commented 1 year ago

Hi Andrew, there was a mistake in Chr names. I fixed that. I reassembled the database and ran the processsamples.py but I got the same error. Do you have any suggestions why this might be happening?

Also, do you know why the expname-qa.html file comes out empty?

andrewdholmes commented 1 year ago

Yeah, I think I need the Rlog.txt file to figure that out. It might be that there is no replicates for your samples, Deseq2 which I use for analysing read counts requires replicates.

On Wed, Nov 2, 2022 at 7:31 AM Ahmet Bakirbas @.***> wrote:

Hi Andrew, there was a mistake in Chr names. I fixed that. I reassembled the database and ran the processsamples.py but I got the same error. Do you have any suggestions why this might be happening?

— Reply to this email directly, view it on GitHub https://github.com/UCSC-LoweLab/tRAX/issues/22#issuecomment-1300536069, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHEMLSXEJYUKNLL5GU5PBDWGJ3MNANCNFSM6AAAAAAROD2KHY . You are receiving this because you commented.Message ID: @.***>

abakirbas commented 1 year ago

Hi Andrew, the Rlog.txt is attached. I have replicates, I introduce them in samplefile.txt. However, one of my groups has 4 reps while the other 3 reps. Do you think that might be creating a problem? Rlog.txt

andrewdholmes commented 1 year ago

Uneven number of replicates in samples shouldn't cause a problem. I think the problem is in the pairfile with the "incomplete final line found" messages, you can check it's using the second field of the samplefile and there's not any excess stuff in there. If it looks good then I can take a look at the samplefile and pairfile and see if I can figure out what's going on.

On Tue, Nov 8, 2022 at 8:39 AM Ahmet Bakirbas @.***> wrote:

Hi Andrew, the Rlog.txt is attached. I have replicates, I introduce them in samplefile.txt. However, one of my groups has 4 reps while the other 3 reps. Do you think that might be creating a problem? Rlog.txt https://github.com/UCSC-LoweLab/tRAX/files/9963278/Rlog.txt

— Reply to this email directly, view it on GitHub https://github.com/UCSC-LoweLab/tRAX/issues/22#issuecomment-1307506140, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHEMLSJTJQRKFUMBQQLPCTWHJ65HANCNFSM6AAAAAAROD2KHY . You are receiving this because you commented.Message ID: @.***>

abakirbas commented 1 year ago

samplefile.txt samplepairs.txt Andrew, I looked, and there was a spacing mistake in the samplefile, so I reran the pipeline. It still gave me the same error. I attached the samplefile and pairfile. Thanks.

abakirbas commented 1 year ago

Andrew, do you have any updates?