MontgomeryLab / tinyRNA

tinyRNA provides an all-in-one solution for precision analysis of sRNA-seq data. At the core of tinyRNA is a highly flexible counting utility, tiny-count, that allows for hierarchical assignment of reads to features based on positional information, extent of feature overlap, 5’ nucleotide, length, and strandedness.
GNU General Public License v3.0
1 stars 1 forks source link

RuntimeError: generator raised StopIteration #330

Closed kittyBS closed 8 months ago

kittyBS commented 8 months ago

Hello, I'm new to the field of Informatics, so please excuse my question. I am trying to use your tool to see the differences in RNA types in the samples I have previously analyzed for microRNA, but I cannot correct the error in the attachment. I am attaching the gtf file I edited, feature, sample and the error output to help you. Also I couldn't figure out how to create feature.csv. Thank you in advance for your interest.

The system I use is Ubuntu 20.04.6 tinyRNA version is v1.5.0 tinyrna.zip

AlexTate commented 8 months ago

Hi @kittyBS, thank you for reaching out, and thank you for providing your configuration files and system info.

The file tinyrna-13652.txt tells me that at least one of the SAM files produced by bowtie is empty. Based on the timestamps, a possible cause for this is that a lot of reads are being lost during the fastp step. I would recommend:

  1. Use a web browser to look at the HTML reports in the fastp folder of your run directory and see if anything stands out
  2. Use a text editor to open the log files in the logs/fastp_* folder of your run directory. Do they say anything about an adapter?
kittyBS commented 8 months ago

Hello, Thank you for your quick return. As you said, in some examples there is no reading left after adapter cleaning. However, I could not find the specific adapter sequence in the log file..'''Read1 before filtering: total reads: 60538173 total bases: 9139350206 Q20 bases: 8008839684(87.6303%) Q30 bases: 7423335694(81.2239%)

Read1 after filtering: total reads: 0 total bases: 0 Q20 bases: 0(-nan%) Q30 bases: 0(-nan%)''' Can I create a bowtie index and print the sam files myself?Or can I do the trimming myself? Note: I solved the problem by providing the adapter sequence myself in run_config_template.yml, but this time, as far as I understand, I am having a memory problem, but if I can look at it correctly, I have enough space in the system. free -h total used free shared buff/cache available Mem: 125Gi 11Gi 41Gi 137Mi 72Gi 113Gi Swap: 8.0Gi 8.0Gi 23Mi tinyrna-13654.txt

AlexTate commented 8 months ago

Can I create a bowtie index and print the sam files myself?Or can I do the trimming myself?

These three tasks are handled automatically during end-to-end runs according to the settings in your Run Config and Paths File. You can also run tiny-count by itself if you have SAM files that were prepared outside of the pipeline.

Note: I solved the problem by providing the adapter sequence

Please double check that the problem has been resolved. The original error is no longer displayed, but tinyrna-13654.txt shows the pipeline stopping at an earlier step than in tinyrna-13652.txt. The tiny-collapse runtimes suggest that a lot of reads are still being lost during the fastp step. I suspect the original error would have been produced if the cluster workload manager (slurmstepd) had not interfered.

as far as I understand, I am having a memory problem slurmstepd: error: Detected 1 oom_kill event

I agree with your conclusion here. The memory readings you gave should be more than adequate for mm10 and 4 samples. The last line of your log mentions slurmstepd which is your cluster workload manager, and I'm guessing that it's configured for a memory limit that isn't high enough. It isn't a component of tinyRNA so you will need to speak with your system administrator about configuration for data intensive jobs like running tinyRNA.

taimontgomery commented 8 months ago

Hi @kittyBS, To add to Alex's comments, you can reduce the memory usage substantially by running the samples sequentially rather than in parallel. To do so, all you need to do is change line 37 of run_config.yml to:

run_parallel: false

I also noticed in your features.csv, you specify "Class" with "any" value. Instead, based on your GTF you should specify each class you're interested in and give an identifier ("Classify as"). And you should also specify the desired "Overlap" between the reads and the features. For example,

Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Overlap,Mismatches,Strand,5' End Nucleotide,Length Class,miRNA,miRNA_gene,,,1,nested,0,both,all,all Class,lincRNA,lincRNA,,,2,nested,0,both,all,all

You should probably also include mature miRNAs, which I noticed are absent in your GTF, so you can compare the results to your previous analysis. If you're interested in the mature miRNAs, you can download a gff3 file from mirbase and use the following rule within your features sheet to just capture mature miRNAs:

Type,miRNA,miRNA,,,1,5' anchored,0,Sense,all,all

A hierarchy value of 1 will ensure that miRNA reads are not counted toward other features. Let us know if you need any further clarification. And if you run into any more issues we're happy to help you out.

kittyBS commented 8 months ago

Hello, Sorry for the late return, I was having a system problem. I'll change the parallel job execution and edit the gtf file, plus I'll ask for additional help from your comments on Features.csv. Actually, I want to see the percentage of RNAs found in my samples, not a specific RNA type, so I avoided giving a specific RNA type.Will giving a specific class as you suggest achieve this? Because when I run it with any, the class chart only creates "Unassigned" and "Unknown".

taimontgomery commented 8 months ago

Hi @kittyBS, It looks like you have 13 classes of ncRNA features, listed below. I suggest that you make a rule in your features sheet for each Class. Most of the files output by tinyRNA will distinguish each of the classes but the total reads assigned to all of these features will also be generated by tiny-count in the alignment_stats.csv spreadsheet (see Total Assigned Reads). Scatter plot showing the difference in counts between your two conditions will be generated by tiny-plot with the classes indicated (see the output in the scatter_by_dge_class folder) or not indicated (see the output in the scatter_by_dge folder).

But if you really don't care about distinguishing the classes and want to capture counts for all overlapping reads, you can have a single rule in your features.csv:

Select for...,with value...,Classify as...,Source Filter,Type Filter,Hierarchy,Overlap,Mismatches,Strand,5' End Nucleotide,Length ,,ncRNA,,,1,Partial,0,both,all,all

This will classify all features as ncRNA, which will accomplish what you want.

Classes snRNA lincRNA miRNA snoRNA misc_RNA rRNA scaRNA bidirectional_promoter_lncRNA 3prime_overlapping_ncRNA sRNA scRNA Mt_tRNA Mt_rRNA

kittyBS commented 8 months ago

Hello, Thanks for your help. Based on your previous comments 1.changed line 37 of run_config.yml to run_parallel: false

  1. I added Mature miRNA to my gtf file.
  2. I changed the Feature.csv Class section to both the ncRNAs I have and the class to just ncRNA. As a result, the analysis ends without any problems, but the class chart is still divided into UNASSIGNED and UNKNOWN. In the scatter_by_dge_class graph I can see the labels of other ncRNAs. I am aware that the quality of the samples I used in the analysis is not very good. I am sure that I will carry out the analysis with new samples and get better results. Thank you for your quick and descriptive answers. My deepest thanks.
taimontgomery commented 8 months ago

Hi @kittyBS, Glad you were able to get it to work. The "Unknown" class will include all the features in your GTF, and "Unassigned" are the reads that aligned to the genome but did not overlap with any of the features in your GTF. You can set the name by changing the "Classify by" value to whatever you choose, such as ncRNA. If you don't specify a name, it will default to "Unknown". See https://github.com/MontgomeryLab/tinyRNA/blob/master/doc/Configuration.md#features-sheet-details for details.