Tiny-count error with Sequence-Based Counting Mode

vicgarcas commented 6 months ago

Hi, I'm having problems running tinyRNA on my samples. I first tried running it with a .gtf reference annotation and as far as I can tell it went ok. Then I tried running it in the "Sequence-Based Counting Mode" using only reference genomes, and I encountered problems. I suspect the problems are in my features.csv files, but I'm not sure and I don't know how to fix it.

To give a bit more information on what I'm doing / want to do, my goal is to:

1) run the samples, which come from C. elegans against the reference genome of another organism, which was used to infect the worm. In this case I can not use a reference annotation. Looking at the .sam files it looks like bowtie run correctly (I have a lot of aligned sequences in my infected samples and no aligned sequences in my control samples). However, I get an error when it reaches tiny-count:

("Error collecting output for parameter 'alignment_stats': ../../miniconda3/envs/tinyrna/lib/python3.10/site-packages/tiny/cwl/tools/tiny-count.cwl:116:7: Did not find output file with glob pattern: ['OrV_ERT54_reinfections_OrV_ref_alignment_2024-04-09_13-59-03_alignment_stats.csv'].", {})ESC[0m ESC[32m[2024-04-09 13:59:34]ESC[0m ESC[1;30mWARNINGESC[0m ESC[33m[job tiny-count] completed permanentFail

I tried using the same features.csv as I used for the reference annotation (see below) and another features.csv, which you suggested to what somebody else asked in github:

Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Overlap,Mismatches,Strand,5' End Nucleotide,Length ,,unk,,,1,Partial,0,both,all,all

Both of them failed at the same point, which as far as I can tell is before or at the beginning of tiny-count. Do you have any advice?

2) another less important issue is that, if possible, I'd also like to run the samples against the reference genome of the worm, which is not really necessary given that it worked with the reference annotation. But in case it's an easy fix... I'm using the same features.csv file that I used for the reference annotation, which I think I got from here or from your lab's webpage:

Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Overlap,Mismatches,Strand,5' End Nucleotide,Length Class,risiRNA,rRNA,,,1,Partial,,both,all,all Class,miRNA,miRNA,,,2,"5' anchored, 0, 4",,sense,all,all Class,piRNA,piRNA,,,2,"5' anchored, 0, 4",,sense,all,18-21 Class,CSR,CSR Class 22G,,,3,Partial,,antisense,G,21-23 Class,WAGO,WAGO Class 22G,,,3,Partial,,antisense,G,21-23 Class,ALG,ALG Class 26G,,,3,Partial,,antisense,G,26 Class,ERGO,ERGO Class 26G,,,3,Partial,,antisense,G,26 Class,ALG,ALG target 22G,,,3,Partial,,antisense,G,21-23 Class,ERGO,ERGO target 22G,,,3,Partial,,antisense,G,21-23 Class,unk,unclassifed 22G,,,3,Partial,,antisense,G,21-23 Class,unk,unclassifed siRNA,,,4,Partial,,both,all,21-24

As far as I can tell it goes ok here until it reaches deseq, because the tiny-count output files seem ok. However, in the deseq output files it looks like everything was classified as rRNA - I don't have any information regarding loci (see attached deseq .csv table).

OrV_ERT54_reinf_ref_genome_12h_2024-04-09_12-29-48_cond1_OrV_cond2_Control_deseq_table.csv

Operating System: Mac OS Ventura 13.4 tinyrna version: 1.5.0

Thank you in advance for your time and for developing this!

Victoria

AlexTate commented 6 months ago

Hi @vicgarcas,

For the first issue, can you please compress the /config and /logs subdirectories in your Run Directory and attach them here? For future reference, the exported terminal output from a failed run can also help us with troubleshooting. You can obtain it through the menus at the top of the screen when the terminal is the foreground window: Shell > Export Text As... . For now, let's start with /config and /logs and go from there.

For the second issue, the DGE table looks like what I would expect for the run you described.

Since you didn't provide reference annotations (GFF/GTF), sequence-based counting was performed against your reference sequences. Conceptually this means that the sequences in your reference genome file act as "features" in tiny-count. It also means that Select for... and with value... aren't used to filter candidate features because the target of these selectors is GFF/GTF column 9.
Since the "sequences" in your reference genome are entire chromosomes, your alignments were counted against whole chromosome "features." This is why the Feature ID column of your DGE table has chromosome identifiers.
When alignments are evaluated using the rules in your Features Sheet, each alignment has only one candidate "feature" to select from: the chromosome to which it was aligned. Rules in the Features Sheet are evaluated by order of hierarchy. Since rule 1 has hierarchy=1, and all of its selectors essentially "allow all", and there are no other rules of hierarchy=1 to evaluate, this means that all reads are assigned to rule 1 and classified as rRNA.

All of that being said, the Features Sheet that you attached was originally intended for feature-based counting so the results won't make sense here. It might be more useful to use specific sequences of interest in your reference sequences file, and then tailor your Features Sheet accordingly.

taimontgomery commented 6 months ago

Hi @vicgarcas, Another approach you might consider is to add the genome sequences of the organisms that you're using for infection to the C. elegans genome sequence file. Then when bowtie runs, it will capture reads from C. elegans and the other species simultaneously, so no need for separate runs. If you add the names and start and end coordinates of each of the other species chromosomes as entries in your GFF/GTF file and specify a unique Type in column 5 of the GFF/GTF, tiny-count will tally those reads separately if specified as distinct rules in the Features Sheet. We'd be happy to provide further guidance if you choose to use this approach. This won't necessarily solve the issue but if you share the files Alex requested, I'm sure we can get to the bottom of it.

MontgomeryLab / tinyRNA

Tiny-count error with Sequence-Based Counting Mode #332