bioinfo-biols / CIRIquant

circular RNA quantification tools
https://sourceforge.net/projects/ciri/files/CIRIquant
MIT License
27 stars 17 forks source link

Generate empty gtf file #29

Closed mrb20045 closed 3 years ago

mrb20045 commented 3 years ago

Hi I ran the CIRIquant and everythings goes well but it generate empty gtf file (only has header as follow

Sample: test

Total_Reads: 8767472

Mapped_Reads: 8257908

Circular_Reads: 0

version: 1.1.2).

My Code: CIRIquant -1 trim-ERR2076987_1_val_1.fq.gz -2 trim-ERR2076987_2_val_2.fq.gz --config CIRIquant.yml -o ./test -p test --circ S1.ciri --tool CIRI2 -t 70

CIRIquant.yml name: Genome tools: bwa: /usr/bin/bwa hisat2: /usr/bin/hisat2 stringtie: /usr/bin/stringtie samtools: /usr/bin/samtools

reference: fasta: /home/mrb/MRB/Genome/Sheep/Ensembl/Index/Ovis_aries.Oar_v3.1.77.fa gtf: /home/mrb/MRB/Genome/Sheep/Ensembl/GTF/Ovis_aries.Oar_v3.1.103.gtf bwa_index: /home/mrb/MRB/Genome/Sheep/Ensembl/Index/Ovis_aries.Oar_v3.1.77.fa hisat_index: /home/mrb/MRB/Genome/Sheep/Ensembl/Index/Ovis_aries.Oar_v3.1.77.fa

S1.ciri (change to txt file to drag here S1.ciri.txt )

Log file: [Wed 2021-03-31 12:52:58] [INFO ] Input reads: trim-ERR2076987_1_val_1.fq.gz,trim-ERR2076987_2_val_2.fq.gz [Wed 2021-03-31 12:52:58] [INFO ] Library type: unstranded [Wed 2021-03-31 12:52:58] [INFO ] Output directory: /home/mrb/MRB/RSeq_All_v6/Data_Example/circRNA/test, Output prefix: test [Wed 2021-03-31 12:52:58] [INFO ] Config: Genome Loaded [Wed 2021-03-31 12:52:58] [INFO ] 80 CPU cores availble, using 70 [Wed 2021-03-31 12:52:58] [INFO ] Align RNA-seq reads to reference genome .. Time loading forward index: 00:00:14 Time loading reference: 00:00:00 Multiseed full-index search: 00:06:00 4383736 reads; of these: 4383736 (100.00%) were paired; of these: 299106 (6.82%) aligned concordantly 0 times 3652446 (83.32%) aligned concordantly exactly 1 time 432184 (9.86%) aligned concordantly >1 times

299106 pairs aligned concordantly 0 times; of these:
  23120 (7.73%) aligned discordantly 1 time
----
275986 pairs aligned 0 times concordantly or discordantly; of these:
  551972 mates make up the pairs; of these:
    400491 (72.56%) aligned 0 times
    133063 (24.11%) aligned exactly 1 time
    18418 (3.34%) aligned >1 times

95.43% overall alignment rate Time searching: 00:06:01 Overall time: 00:06:15 [bam_sort_core] merging from 0 files and 70 in-memory blocks... [Wed 2021-03-31 12:59:53] [INFO ] Estimate gene abundance .. [Wed 2021-03-31 13:00:33] [INFO ] Using predicted circRNA results from CIRI2: /home/mrb/MRB/RSeq_All_v6/Data_Example/circRNA/circRNA_Results/3_CIRI2/Sample_1/CIRI_output/S1.ciri [Wed 2021-03-31 13:00:33] [INFO ] Extract circular sequence [Wed 2021-03-31 13:01:03] [INFO ] Building circular index .. Settings: Output files: "/home/mrb/MRB/RSeq_All_v6/Data_Example/circRNA/test/circ/test_index..ht2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Local sequence length: 57344 Local sequence overlap between two consecutive indexes: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: /home/mrb/MRB/RSeq_All_v6/Data_Example/circRNA/test/circ/test_index.fa Reading reference sizes Time reading reference sizes: 00:00:00 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:00 Time to read SNPs and splice sites: 00:00:00 Using parameters --bmax 568893 --dcv 1024 Doing ahead-of-time memory usage test Passed! Constructing with these parameters: --bmax 568893 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample Building sPrime Building sPrimeOrder V-Sorting samples V-Sorting samples time: 00:01:05 Allocating rank array Ranking v-sort output Ranking v-sort output time: 00:00:00 Invoking Larsson-Sadakane on ranks Invoking Larsson-Sadakane on ranks time: 00:00:00 Sanity-checking and returning Building samples Reserving space for 12 sample suffixes Generating random suffixes QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples (Using difference cover) Multikey QSorting samples time: 00:00:00 Calculating bucket sizes Splitting and merging Splitting and merging time: 00:00:00 Split 1, merged 6; iterating... Splitting and merging Splitting and merging time: 00:00:00 Avg bucket size: 379261 (target: 568892) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering GFM loop Getting block 1 of 8 Reserving size (568893) for bucket 1 Getting block 2 of 8 Reserving size (568893) for bucket 2 Getting block 3 of 8 Reserving size (568893) for bucket 3 Calculating Z arrays for bucket 1 Getting block 4 of 8 Reserving size (568893) for bucket 4 Calculating Z arrays for bucket 2 Getting block 5 of 8 Entering block accumulator loop for bucket 1: Calculating Z arrays for bucket 4 Reserving size (568893) for bucket 5 Getting block 6 of 8 Reserving size (568893) for bucket 6 Entering block accumulator loop for bucket 2: Getting block 7 of 8 Entering block accumulator loop for bucket 4: Reserving size (568893) for bucket 7 Calculating Z arrays for bucket 5 Calculating Z arrays for bucket 3 Calculating Z arrays for bucket 6 Getting block 8 of 8 Reserving size (568893) for bucket 8 Calculating Z arrays for bucket 7 Entering block accumulator loop for bucket 5: Entering block accumulator loop for bucket 3: Entering block accumulator loop for bucket 6: Calculating Z arrays for bucket 8 Entering block accumulator loop for bucket 7: Entering block accumulator loop for bucket 8: bucket 8: 10% bucket 1: 10% bucket 2: 10% bucket 4: 10% bucket 3: 10% bucket 7: 10% bucket 6: 10% bucket 5: 10% bucket 8: 20% bucket 1: 20% bucket 2: 20% bucket 4: 20% bucket 3: 20% bucket 7: 20% bucket 8: 30% bucket 1: 30% bucket 6: 20% bucket 5: 20% bucket 2: 30% bucket 8: 40% bucket 1: 40% bucket 4: 30% bucket 3: 30% bucket 7: 30% bucket 6: 30% bucket 2: 40% bucket 8: 50% bucket 5: 30% bucket 1: 50% bucket 4: 40% bucket 3: 40% bucket 7: 40% bucket 8: 60% bucket 2: 50% bucket 1: 60% bucket 6: 40% bucket 5: 40% bucket 8: 70% bucket 4: 50% bucket 3: 50% bucket 1: 70% bucket 7: 50% bucket 2: 60% bucket 8: 80% bucket 6: 50% bucket 1: 80% bucket 4: 60% bucket 5: 50% bucket 3: 60% bucket 2: 70% bucket 7: 60% bucket 8: 90% bucket 1: 90% bucket 4: 70% bucket 8: 100% Sorting block of length 565092 for bucket 8 (Using difference cover) bucket 2: 80% bucket 6: 60% bucket 3: 70% bucket 5: 60% bucket 7: 70% bucket 1: 100% Sorting block of length 558897 for bucket 1 (Using difference cover) bucket 2: 90% bucket 4: 80% bucket 3: 80% bucket 6: 70% bucket 7: 80% bucket 5: 70% bucket 2: 100% Sorting block of length 250942 for bucket 2 (Using difference cover) bucket 4: 90% bucket 3: 90% bucket 7: 90% bucket 6: 80% bucket 5: 80% bucket 3: 100% Sorting block of length 341167 for bucket 3 (Using difference cover) bucket 4: 100% Sorting block of length 320619 for bucket 4 (Using difference cover) bucket 7: 100% Sorting block of length 266301 for bucket 7 (Using difference cover) bucket 6: 90% bucket 5: 90% bucket 6: 100% Sorting block of length 458788 for bucket 6 (Using difference cover) bucket 5: 100% Sorting block of length 272283 for bucket 5 (Using difference cover) Sorting block time: 00:00:01 Returning block of 250943 for bucket 2 Sorting block time: 00:00:01 Returning block of 266302 for bucket 7 Sorting block time: 00:00:01 Returning block of 272284 for bucket 5 Sorting block time: 00:00:02 Returning block of 320620 for bucket 4 Sorting block time: 00:00:02 Returning block of 341168 for bucket 3 Sorting block time: 00:00:02 Returning block of 458789 for bucket 6 Sorting block time: 00:00:03 Returning block of 558898 for bucket 1 Sorting block time: 00:00:03 Returning block of 565093 for bucket 8 Exited GFM loop fchr[A]: 0 fchr[C]: 846358 fchr[G]: 1500536 fchr[T]: 2169082 fchr[$]: 3034096 Exiting GFM::buildToDisk() Returning from initFromVector Wrote 5210376 bytes to primary GFM file: /home/mrb/MRB/RSeq_All_v6/Data_Example/circRNA/test/circ/test_index.1.ht2 Wrote 758532 bytes to secondary GFM file: /home/mrb/MRB/RSeq_All_v6/Data_Example/circRNA/test/circ/test_index.2.ht2 Re-opening _in1 and _in2 as input streams Returning from GFM constructor Returning from initFromVector Wrote 1829845 bytes to primary GFM file: /home/mrb/MRB/RSeq_All_v6/Data_Example/circRNA/test/circ/test_index.5.ht2 Wrote 765828 bytes to secondary GFM file: /home/mrb/MRB/RSeq_All_v6/Data_Example/circRNA/test/circ/test_index.6.ht2 Re-opening _in5 and _in5 as input streams Returning from HierEbwt constructor Headers: len: 3034096 gbwtLen: 3034097 nodes: 3034097 sz: 758524 gbwtSz: 758525 lineRate: 6 offRate: 4 offMask: 0xfffffff0 ftabChars: 10 eftabLen: 0 eftabSz: 0 ftabLen: 1048577 ftabSz: 4194308 offsLen: 189632 offsSz: 758528 lineSz: 64 sideSz: 64 sideGbwtSz: 48 sideGbwtLen: 192 numSides: 15803 numLines: 15803 gbwtTotLen: 1011392 gbwtTotSz: 1011392 reverse: 0 linearFM: Yes Total time for call to driver() for forward index: 00:01:10 [Wed 2021-03-31 13:02:13] [INFO ] De novo alignment for circular RNAs .. 4383736 reads; of these: 4383736 (100.00%) were paired; of these: 4361810 (99.50%) aligned concordantly 0 times 4958 (0.11%) aligned concordantly exactly 1 time 16968 (0.39%) aligned concordantly >1 times

4361810 pairs aligned concordantly 0 times; of these:
  63 (0.00%) aligned discordantly 1 time
----
4361747 pairs aligned 0 times concordantly or discordantly; of these:
  8723494 mates make up the pairs; of these:
    8692404 (99.64%) aligned 0 times
    6241 (0.07%) aligned exactly 1 time
    24849 (0.28%) aligned >1 times

0.86% overall alignment rate [bam_sort_core] merging from 0 files and 70 in-memory blocks... [Wed 2021-03-31 13:08:03] [INFO ] Detecting reads containing Back-splicing signals [Wed 2021-03-31 13:08:04] [INFO ] Detecting FSJ reads from genome alignment file [Wed 2021-03-31 13:08:57] [INFO ] Merge bsj and fsj results [Wed 2021-03-31 13:08:57] [INFO ] Loading annotation gtf .. [Wed 2021-03-31 13:09:03] [INFO ] Output circRNA expression values [Wed 2021-03-31 13:09:04] [INFO ] circRNA Expression profile: test.gtf [Wed 2021-03-31 13:09:04] [INFO ] Finished!

Kevinzjy commented 3 years ago

Hi @mrb20045 , what's your sequencing read length? By the way, could you try running CIRIquant without --circ and --tool options, and reduce the -t number to something smaller like 16?

mrb20045 commented 3 years ago

Hi @mrb20045 , what's your sequencing read length? By the way, could you try running CIRIquant without --circ and --tool options, and reduce the -t number to something smaller like 16?

Sequence length after trimming 100-125

I tried with -t 8 as well as without --circ and --tool options (with bed file) and the same problem was occurred.

Kevinzjy commented 3 years ago

Well, that means that all your BSJ reads have unsolid alignment pattern or could probably align to somewhere else with better score. So no BSJ reads were reported by CIRIquant.

Could you take a closer look at the BSJ reads (IDs in the last column of CIRI2 output), and show me their alignment results in align/test.sorted.bam and circ/test_denovo.sorted.bam ? It would help me determine what's happening.

mrb20045 commented 3 years ago

test_denovo.sorted

Please check the files that contain the mapped reads based on the IDs in the last column of CIRI2 output. It is merit to mention that I ran CIRIquant with big data and the same problem was occurred too.

test.sorted.txt

test_denovo.sorted.txt

Kevinzjy commented 3 years ago

I noticed that many BSJ reads in CIRI2 can be perfectly aligned to the reference genome. Although CIRI2 reported these reads as BSJ from BWA-mem alignment results, CIRIquant will treat them as false positive, which is quite reasonable considering that BWA-mem is not specifically designed to handle spliced alignment of RNA-seq reads.

ERR2076987.3959871  163 13  63166030    1   60M3925N55M =   63170014    184 ATCGGGACCGGCTGCCATCTTAGTCGAGGGACTGAAGAGTAGCCGCCGCCCCGTGTCCCGGTAACATGCATTTAACCGTGGCCTTCTGGAGACAGCGCCTTAATCCAAAGAAGTG BBCCBGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGEGEGGGBGGGGGG AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:115    YS:i:0  YT:Z:CP XS:A:+  NH:i:2
ERR2076987.3959871  419 13  63166030    1   60M3925N55M =   63166089    3981    ATCGGGACCGGCTGCCATCTTAGTCGAGGGACTGAAGAGTAGCCGCCGCCCCGTGTCCCGGTAACATGCATTTAACCGTGGCCTTCTGGAGACAGCGCCTTAATCCAAAGAAGTG BBCCBGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGEGEGGGBGGGGGG AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:115    YS:i:0  YT:Z:CP XS:A:+  NH:i:2

Besides, HISAT2 weirdly reported a lot of CIRI2 BSJ reads as unmapped in test.sorted.bam. How's the quality of your reference genome? Are there many repetitive sequences or gaps?.

mrb20045 commented 3 years ago

First of all Thanks for your support.

Quality of the genome is good (sheep genome). I fell that something's goes wrong as I got the same problem with real data (about 100 million paired reads). Do you have any test data to try it?

Or do you have any suggestion to fix it?

Regards

On Thu, Apr 1, 2021, 07:42 Jinyang Zhang @.***> wrote:

I noticed that many BSJ reads in CIRI2 can be perfectly aligned to the reference genome. Although CIRI2 reported these reads as BSJ from BWA-mem alignment results, CIRIquant will treat them as false positive, which is a reasonable considering that BWA-mem is not specifically designed to handle spliced alignment of RNA-seq reads.

ERR2076987.3959871 163 13 63166030 1 60M3925N55M = 63170014 184 ATCGGGACCGGCTGCCATCTTAGTCGAGGGACTGAAGAGTAGCCGCCGCCCCGTGTCCCGGTAACATGCATTTAACCGTGGCCTTCTGGAGACAGCGCCTTAATCCAAAGAAGTG BBCCBGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGEGEGGGBGGGGGG AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:115 YS:i:0 YT:Z:CP XS:A:+ NH:i:2 ERR2076987.3959871 419 13 63166030 1 60M3925N55M = 63166089 3981 ATCGGGACCGGCTGCCATCTTAGTCGAGGGACTGAAGAGTAGCCGCCGCCCCGTGTCCCGGTAACATGCATTTAACCGTGGCCTTCTGGAGACAGCGCCTTAATCCAAAGAAGTG BBCCBGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGEGEGGGBGGGGGG AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:115 YS:i:0 YT:Z:CP XS:A:+ NH:i:2

Besides, HISAT2 weirdly reported a lot of CIRI2 BSJ reads as unmapped in test.sorted.bam. How's the quality of your reference genome? Are there many repetitive sequences or gaps?.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bioinfo-biols/CIRIquant/issues/29#issuecomment-811606114, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACLTVHXTSOY77UMR7Z4KKM3TGPQC3ANCNFSM42EC5GCA .

Kevinzjy commented 3 years ago

Hi @mrb20045 , you can try the test dataset from CIRIquant documentation.

Besides, I also noticed that a lot of your read pairs have completely overlapping sequences after trimming, which means the library insert size is smaller than your read length, and the pseudo-circular alignment strategy in CIRIquant might not be able to handle these reads correctly.

Under these circumstances, I would recommend you to merge overlapping read pairs with FLASH, then run CIRI2 in single-end mode for circRNAs detection.

mrb20045 commented 3 years ago

Hi @mrb20045 , you can try the test dataset from CIRIquant documentation.

Besides, I also noticed that a lot of your read pairs have completely overlapping sequences after trimming, which means the library insert size is smaller than your read length, and the pseudo-circular alignment strategy in CIRIquant might not be able to handle these reads correctly.

Under these circumstances, I would recommend you to merge overlapping read pairs with FLASH, then run CIRI2 in single-end mode for circRNAs detection.

Dear Kevinzjy

I ran CIRI2 pipeline as follow: CIRI-full.jar Pipeline -1 trim-ileum_female_OAR_RI_AF2_ERS2107162_1.fastq.gz -2 trim-ileum_female_OAR_RI_AF2_ERS2107162_2.fastq.gz -r Ovis_aries.Oar_v3.1.77.fa -a Ovis_aries.Oar_v3.1.103.gtf -d Sample_2 -t 20 it generate the circRNAs as follow:

S2.ciri.txt

moreover CIRCexplorer2 pipeline was used for this data and circRNAs were identified as they had acceptable overlap with CIRI2.

Then CIRIquant was applied as follow: CIRIquant -t 20 -1 trim-ileum_female_OAR_RI_AF2_ERS2107162_1.fastq.gz -2 trim-ileum_female_OAR_RI_AF2_ERS2107162_2.fastq.gz --config configure.yml -o ./testmain -p testmain

configure.yml
name: hg19 tools: bwa: /usr/bin/bwa hisat2: /usr/bin/hisat2 stringtie: /usr/bin/stringtie samtools: /usr/bin/samtools

reference: fasta: /home/mrb/MRB/Genome/Sheep/Ensembl/Index/Ovis_aries.Oar_v3.1.77.fa gtf: /home/mrb/MRB/Genome/Sheep/Ensembl/GTF/Ovis_aries.Oar_v3.1.103.gtf bwa_index: /home/mrb/MRB/Genome/Sheep/Ensembl/Index/Ovis_aries.Oar_v3.1.77.fa hisat_index: /home/mrb/MRB/Genome/Sheep/Ensembl/Index/Ovis_aries.Oar_v3.1.77.fa

It generate circRNAs as follow:

S2_CIRIquant.ciri.txt

And expression file was generated as follow: S2_CIRIquant.gtf.txt

Unfortunately it has not any overlap with CIRI2 and CIRCexplorer2. Could you please check the results. It at least have to show some overlap circRNAs with CIRI2.

Kevinzjy commented 3 years ago

Hi @mrb20045 , I checked your results, and found a weird issue. The following lines demonstrate same BSJ reads that assigned to different circRNA coordinates in CIRIquant and CIRI-full:

zhangjy@Z4G4:code grep ERR2077263.3471536 ./*.txt
./S2_CIRIquant.ciri.txt:1:235439|236760 1       235439  236760  3       2_3_1   121     0.047   exon    ENSOARG00000017642,     + ERR2077270.3879443,ERR2077263.3471536,ERR2077261.2285123,
./S2.ciri.txt:1:268490|270677   1       268490  270677  3       3_3_0   112     0.051   exon    ENSOARG00020000069,     +       ERR2077263.3471536,ERR2077270.3879443,ERR2077261.2285123,

Considering that CIRI-full and CIRIquant use the same command when running CIRI2, you should carefully check the version of reference genome used for building bwa_index and hisat_index. I would recommend you re-generate these index and run CIRIquant again, and check your input and output files carefully.

mrb20045 commented 3 years ago

Hi @mrb20045 , I checked your results, and found a weird issue. The following lines demonstrate same BSJ reads that assigned to different circRNA coordinates in CIRIquant and CIRI-full:

zhangjy@Z4G4:code grep ERR2077263.3471536 ./*.txt
./S2_CIRIquant.ciri.txt:1:235439|236760 1       235439  236760  3       2_3_1   121     0.047   exon    ENSOARG00000017642,     + ERR2077270.3879443,ERR2077263.3471536,ERR2077261.2285123,
./S2.ciri.txt:1:268490|270677   1       268490  270677  3       3_3_0   112     0.051   exon    ENSOARG00020000069,     +       ERR2077263.3471536,ERR2077270.3879443,ERR2077261.2285123,

Considering that CIRI-full and CIRIquant use the same command when running CIRI2, you should carefully check the version of reference genome used for building bwa_index and hisat_index. I would recommend you re-generate these index and run CIRIquant again, and check your input and output files carefully.

Thanks so much. I think that it was my problem as I used a new version of the genome. CIRIquant could not generate gtf file as I used bed file based on the previous version of the genome with different coordinates. Actually the circRNAs coordinates were predicted based on the old genome and th new version was used in CIRIquant. Thanks