egaffo / circompara2

Improved bioinformatic pipeline to identify and quantify circRNA expression from RNA-seq data by combining multiple circRNA detection methods
Other
8 stars 0 forks source link

-ignoreGroupsWithoutExons & singularity #21

Open miguilab opened 7 months ago

miguilab commented 7 months ago

Hi egaffo, I'm trying to run circompara on my samples, but I got the following error:

finished circRNA detection from file samples/S1/processings/circRNAs/star_out/Chimeric.out.junction dcc_fix_strand.R -c samples/S1/processings/circRNAs/dcc/CircRNACount -d samples/S1/processings/circRNAs/dcc/CircCoordinates -o samples/S1/processings/circRNAs/dcc/strandedCircRNACount dcc_compare.R -l S1 -i samples/S1/processings/circRNAs/dcc/strandedCircRNACount -o circular_expression/circrna_collection/merged_samples_circrnas/dcc_compared.csv gtfToGenePred -infoOut=dbs/indexes/genePred.transcripts.info /home/miguilab/circular/BL/ncbi_dataset/data/GCA_927797965.1/genomic.gtf dbs/indexes/genomic.genePred no exons defined for group , feature gene (perhaps try -ignoreGroupsWithoutExons) scons: *** [dbs/indexes/genePred.transcripts.info] Error 255 scons: building terminated because of errors.

What could I do to solve it? I don't understand where I could use the option -ignoreGroupsWithoutExons. Further, do you think there's any way to run circompara2 through singularity? I've successfully used docker before, but now I don't have sudo privileges on my machine anymore.

Best,

Migui

egaffo commented 7 months ago

1) Something needs to be fixed with your annotation file. The gtfToGenePred program is used by circompara2 to generate a .genePred file, but it failed with your data. From the error, I guess there is a gene entry with no exons. You may spot the faulty line in your GTF and delete it, or make yourself the .genePred file by downloading the gtfToGenePred program (I used this one), run it with the option suggested in the error, prepend gene names to each line, and then pass the resulting file as a circompara2 variable in the vars.py file (GENEPRED='your_genepred_file.genePred'). These are the two commands circompara2 runs to prepare the proper genePred with gene names file ".genePred.wgn" (with human GTF annotations):

gtfToGenePred -infoOut=annotation/genePred.transcripts.info /T2T_CHM13/Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf annotation/Homo_sapiens-GCA_009914755.4-2022_07-genes.genePred
cut -f9 annotation/genePred.transcripts.info | grep -v geneName | paste - annotation/Homo_sapiens-GCA_009914755.4-2022_07-genes.genePred | sed "s_^\t\([^\t]*\)\t\(.*\)_\1\t\1\t\2_" > annotation/Homo_sapiens-GCA_009914755.4-2022_07-genes.genePred.wgn

2) I know that docker containers can be converted into singularity, but I've never done it. You could try to convert it and give some feedback about it.

miguilab commented 4 months ago

Hi egaffo, I solved the previous error by generating my own genePred file with gtfToGenePred, as you suggested. However, I am now encountering another error:

scons: *** [samples/S1/processings/circRNAs/tophat_out/accepted_hits.bam] Error 1 scons: building terminated because of errors.[2024-02-29 01:21:14] Searching for junctions via segment mapping [2024-02-29 02:03:19] Retrieving sequences for splices [2024-02-29 02:03:30] Indexing splices [2024-02-29 02:07:11] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie (1/5) [2024-02-29 02:52:44] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie (2/5) [2024-02-29 03:38:16] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie (3/5) [2024-02-29 04:23:14] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie (4/5) [2024-02-29 05:04:31] Mapping left_kept_reads.m2g_um_seg5 to genome segment_juncs with Bowtie (5/5) [2024-02-29 05:41:26] Joining segment hits [FAILED] Error running 'long_spanning_reads':Loading fusions...done

scons: *** [samples/S1/processings/circRNAs/tophat_out/accepted_hits.bam] Error 1 scons: building terminated because of errors.

I just want to say that I considered that it may be a memory problem, but the pipeline with another species data is working fine (I have 128 Gb of RAM, however). So I guess something is off with the genome I am using now, but I cannot figure it out what. It worked fine for standard analyses like linear RNA-seq, ATAC-seq, ChIP-seq, etc...

Migui

egaffo commented 4 months ago

That is an error thrown by Tophat2. It could be a problem with the annotation file. You may google the error to see possible solutions.

miguilab commented 4 months ago

It says it could be memory problem. I am not sure this is my case. I'll try again.

miguilab commented 4 months ago

I also found this but I really cannot figure it out what is wrong with my GTF, it looks normal to me:

https://www.biostars.org/p/191006/

Do you think it would be possible to run circompara excluding bowtie?

egaffo commented 4 months ago

Sure, you can select any combination of the 9 methods supported (i.e., ciri, findcirc, circexplorer2_star, circexplorer2_bwa, circexplorer2_segemehl, circexplorer2_tophat, dcc, testrealign, circrna_finder) CirComPara2's default is to use the following 7 methods: circexplorer2_bwa,circexplorer2_segemehl,circexplorer2_star,circexplorer2_tophat,ciri,findcirc, and dcc. To exclude TopHat-Fusion, remove the circexplorer2_tophat entry from the (default) method option. In the vars.py file, add the following line (or replace it if you already have it): CIRCRNA_METHODS='circexplorer2_bwa,circexplorer2_segemehl,circexplorer2_star,ciri,findcirc,dcc'

miguilab commented 4 months ago

Hi egaffo,

I tried to run the pipeline with the line

CIRCRNA_METHODS='circexplorer2_bwa,circexplorer2_segemehl,circexplorer2_star,ciri,findcirc,dcc'

Now I have another different error:

[main] Real time: 5251.107 sec; CPU: 21231.339 sec CIRCexplorer2 parse -b samples/S1/processings/circRNAs/CIRCexplorer2_bwa/back_spliced_junction.bed -t BWA samples/S1/processings/circRNAs/bwa_out/S1_bwa.sam.gz | tee samples/S1/processings/circRNAs/CIRCexplorer2_bwa/CIRCexplorer2_bwa.log CIRCexplorer parameters: /circompara2/src/utils/bash/../../../bin/CIRCexplorer2 parse -b samples/S1/processings/circRNAs/CIRCexplorer2_bwa/back_spliced_junction.bed -t BWA samples/S1/processings/circRNAs/bwa_out/S1_bwa.sam.gz Start CIRCexplorer2 parse at 21:29:08 Start parsing fusion junctions from BWA... Converted 519719 fusion reads! End CIRCexplorer2 parse at 21:34:56 zcat samples/S1/processings/circRNAs/bwa_out/S1_bwa.sam.gz | samtools view -F 4 - | cut -f 1 | sort | uniq | wc -l > samples/S1/processings/circRNAs/bwa_out/BWA_mapped_reads_count.txt cd samples/S1/processings/circRNAs/CIRCexplorer2_bwa/annotate && CIRCexplorer2 annotate -r /data/ncbi_dataset/data/GCA_927797965.1/BraLan3.genPred -g /data/ncbi_dataset/data/GCA_927797965.1/GCA_927797965.1_BraLan3_genomic.fna -b /data/samples/S1/processings/circRNAs/CIRCexplorer2_bwa/back_spliced_junction.bed -o circularRNA_known.txt | tee CIRCexplorer2_bwa_annotate.log && cd /data Traceback (most recent call last): File "/circompara2/src/utils/bash/../../../bin/CIRCexplorer2", line 8, in sys.exit(main()) File "/circompara2/tools/lib/python2.7/site-packages/circ2/command_parse.py", line 51, in main CIRCexplorer parameters: /circompara2/src/utils/bash/../../../bin/CIRCexplorer2 annotate -r /data/ncbi_dataset/data/GCA_927797965.1/BraLan3.genPred -g /data/ncbi_dataset/data/GCA_927797965.1/GCA_927797965.1_BraLan3_genomic.fna -b /data/samples/S1/processings/circRNAs/CIRCexplorer2_bwa/back_spliced_junction.bed -o circularRNA_known.txt Start CIRCexplorer2 annotate at 21:39:22 Start to annotate fusion junctions... command=command_log, name='annotate') File "/circompara2/tools/lib/python2.7/site-packages/circ2/helper.py", line 38, in wrapper fn(*args) File "/circompara2/tools/lib/python2.7/site-packages/circ2/annotate.py", line 38, in annotate secondary_flag=options['--low-confidence']) File "/circompara2/tools/lib/python2.7/site-packages/circ2/annotate.py", line 50, in annotate_fusion genes, novel_genes, gene_info, chrom_info = parse_ref(ref_f, 1) File "/circompara2/tools/lib/python2.7/site-packages/circ2/parser.py", line 79, in parse_ref ends = [int(x) for x in line.split()[10].rstrip(',').split(',')] IndexError: list index out of range CIRCexplorer_compare.R -l S1 -i samples/S1/processings/circRNAs/CIRCexplorer2_bwa/annotate/circularRNA_known.txt -o circular_expression/circrna_collection/merged_samples_circrnas/CIRCexplorer2_bwa_compared.csv Error in file(file, "rt") : cannot open the connection Calls: ldply ... llply -> structure -> lapply -> FUN -> read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file 'samples/S1/processings/circRNAs/CIRCexplorer2_bwa/annotate/circularRNA_known.txt': No such file or directory Execution halted scons: *** [circular_expression/circrna_collection/merged_samples_circrnas/CIRCexplorer2_bwa_compared.csv] Error 1 scons: building terminated because of errors.

Do you think this is still due to gtf problems? Is there some other kind of parameter that I can change and/or tool that I can exclude to try to finish the pipeline?

Best,

Migui

egaffo commented 4 months ago

I think there is something wrong with your annotation. This error comes from the CIRCexplorer2 parser of the annotation file. Perhaps you can find some more error info in the CIRCexplorer2 log file (samples/S1/processings/circRNAs/CIRCexplorer2_bwa/annotate/CIRCexplorer2_bwa_annotate.log). From the error message, I guess one of the following formatting errors occurred:

The problem you had with TopHat2 suggests the second issue is more likely.

miguilab commented 4 months ago

Hi egaffo. Thanks for your help. I compared the genePred file of a species where the pipeline worked and did it everything by itself (zebrafish) and the genePred file that I made for the new species that is not working (amphioxus). Both of them have 10 columns. I'll paste the header.

zebrafish:

ENSDART00000159919 4 + 30402836 30403763 30403763 30403763 3 30402836,30403202,30403545, 30402893,30403350,30403763, ENSDART00000021139 4 + 1722898 1730920 1725131 1730113 6 1722898,1725129,1726487,1726677,1728599,1729975, 1722927,1725266,1726605,1726820,1728713,1730920,

amphioxus:

_unassigned_transcript_1 OV696686.1 - 1786 13403 2549 13403 13 1786,2819,3133,4516,5165,5595,5993,6836,7203,7679,7935,8211,13339, 2675,2953,3319,4639,5292,5780,6251,6952,7351,7837,8074,8520,13403, unassigned_transcript2 OV696686.1 - 17098 29136 18557 29054 10 17098,18874,19045,20259,20820,26393,26650,27000,27840,28914, 18720,18934,19151,20318,20976,26496,26760,27105,27924,29136,

However, for zebrafish I also found the genePred.wgn file generated automatically by the pipeline:

ENSDARG00000103202 ENSDART00000159919 4 + 30402836 30403763 30403763 30403763 3 30402836,30403202,30403545, 30402893,30403350,30403763, ENSDARG00000009657 ENSDART00000021139 4 + 1722898 1730920 1725131 1730113 6 1722898,1725129,1726487,1726677,1728599,1729975, 1722927,1725266,1726605,1726820,1728713,1730920, ENSDARG00000009657 ENSDART00000144356 4 + 1722902 1725794 1725794 1725794 2 1722902,1725129, 1722927,1725794,

I generated manually the genePred.wgn for amphioxus with the command you gave me and I obtained this:

_unassigned_transcript_1 unassigned_transcript_1 OV696686.1 - 1786 13403 2549 13403 13 1786,2819,3133,4516,5165,5595,5993,6836,7203,7679,7935,8211,13339, 2675,2953,3319,4639,5292,5780,6251,6952,7351,7837,8074,8520,13403, unassigned_transcript_2 unassigned_transcript2 OV696686.1 - 17098 29136 18557 29054 10 17098,18874,19045,20259,20820,26393,26650,27000,27840,28914, 18720,18934,19151,20318,20976,26496,26760,27105,27924,29136,

However, given the nature of the annotation of my file, I would prefer the "gene name" to be another column of the transcripts.info, solo I slightly changed the parameters of the cut command and I obtained this wgn file:

_BLAG_LOCUS25688 unassigned_transcript_1 OV696686.1 - 1786 13403 2549 13403 13 1786,2819,3133,4516,5165,5595,5993,6836,7203,7679,7935,8211,13339, 2675,2953,3319,4639,5292,5780,6251,6952,7351,7837,8074,8520,13403, BLAG_LOCUS1 unassigned_transcript2 OV696686.1 - 17098 29136 18557 29054 10 17098,18874,19045,20259,20820,26393,26650,27000,27840,28914, 18720,18934,19151,20318,20976,26496,26760,27105,27924,29136,

Can you please tell me if these two files look fine to you? Further its's not clear to me if in vars.py file I should state:

GENEPRED='your_genepred_file.genePred'

or

GENEPRED='your_genepred_file.genePred.wgn'

also given to the fact that my genePred files always have 10 columns (also when they work properly) and I don't find the wgn file to be generated in amphioxus, when the pipeline returns errors.

Thank you for your help,

Migui

egaffo commented 4 months ago

Your file with the BLAG_LOCUSXX should be fine. It has 11 fields. Use GENEPRED='your_genepred_file.genePred.wgn' (wgn stands for With Gene Names, just to distinguish the file from its original 10 fields version). However, the extension name is not important. If the GENEPRED variable is not set in the vars.py file, then CirComPara2 will generate it with exactly the commands I posted above (that is why you could find both .genePred and .genePred.wgn for zebrafish).

miguilab commented 3 months ago

Hi egaffo, I have some news!

The pipeline continued to fail even with the new genePred file. However, this time the error was different, involving stringtie and gtf. Clearly, we knew from the beginning that the problem was the gtf... but I couldn't get my head around it looked ok and especially because no features seemed to have no exons, referring to the first error I had:

no exons defined for group , feature gene (perhaps try -ignoreGroupsWithoutExons) scons: *** [dbs/indexes/genePred.transcripts.info] Error 255 scons: building terminated because of errors.

Finally after a thousand searches, unheeded threats to the computer and some crying I noticed that some transcript_id fields existed but were empty. Example:

gene_id "BLAG_LOCUS1"; transcript_id "";

I modified the gtf and relaunched everything, with all the circompare methods, and the pipeline worked correctly. I am leaving here the command I used to fix the gtf, in case someone else runs into the same error and may need it. Thank you very much for all the help during troubleshooting.

grep -P '\btranscript_id\s+"[^"]+"' file.gtf > fixed_file.gtf

Migui