bioinfo-biols / CIRIquant

circular RNA quantification tools
https://sourceforge.net/projects/ciri/files/CIRIquant
MIT License
27 stars 17 forks source link

Error in 'Detecting FSJ reads from genome alignment file' #40

Open stefanerb89 opened 1 year ago

stefanerb89 commented 1 year ago

Hello,

first sample was done very well, but my second sample is stucking. My shell output is the following:

-1 SID16126_S10_L001_R1_001.fastq.gz -2 SID16126_S10_L001_R2_001.fastq.gz [Wed 2022-07-20 15:44:37] [INFO ] Input reads: SID16126_S10_L001_R1_001.fastq.gz,SID16126_S10_L001_R2_001.fastq.gz [Wed 2022-07-20 15:44:37] [INFO ] Library type: unstranded [Wed 2022-07-20 15:44:37] [INFO ] Output directory: /home/serbe204/Documents/ciri2, Output prefix: SID16126_S10_L001_R1_001.fastq.gz [Wed 2022-07-20 15:44:37] [INFO ] Config: hg38 Loaded [Wed 2022-07-20 15:44:37] [INFO ] 32 CPU cores availble, using 4 [Wed 2022-07-20 15:44:37] [INFO ] Align RNA-seq reads to reference genome .. [Wed 2022-07-20 16:19:24] [INFO ] Estimate gene abundance .. [Wed 2022-07-20 16:23:31] [INFO ] No circRNA information provided, run CIRI2 for junction site prediction .. [Wed 2022-07-20 16:23:31] [INFO ] Running BWA-mem mapping candidate reads .. [Wed 2022-07-20 18:09:05] [INFO ] Running CIRI2 for circRNA detection .. [Wed 2022-07-20 18:41:14] [INFO ] Extract circular sequence [Wed 2022-07-20 18:41:20] [100% ] [##################################################] [Wed 2022-07-20 18:41:20] [INFO ] Building circular index .. [Wed 2022-07-20 18:41:29] [INFO ] De novo alignment for circular RNAs .. [Wed 2022-07-20 19:16:26] [INFO ] Detecting reads containing Back-splicing signals [Wed 2022-07-20 19:29:59] [INFO ] Detecting FSJ reads from genome alignment file [Wed 2022-07-20 20:24:23] [INFO ] Merge bsj and fsj results [Wed 2022-07-20 20:24:23] [INFO ] Loading annotation gtf .. [Wed 2022-07-20 20:24:31] [INFO ] Output circRNA expression values [Wed 2022-07-20 20:24:32] [WARNING] chrom of contig "KI270711.1" not in annotation gtf, please check [Wed 2022-07-20 20:24:41] [INFO ] circRNA Expression profile: SID16126_S10_L001_R1_001.fastq.gz.gtf [Wed 2022-07-20 20:24:41] [INFO ] Finished! -1 SID16127_S11_L001_R1_001.fastq.gz -2 SID16127_S11_L001_R2_001.fastq.gz [Wed 2022-07-20 20:24:42] [INFO ] Input reads: SID16127_S11_L001_R1_001.fastq.gz,SID16127_S11_L001_R2_001.fastq.gz [Wed 2022-07-20 20:24:42] [INFO ] Library type: unstranded [Wed 2022-07-20 20:24:42] [INFO ] Output directory: /home/serbe204/Documents/ciri2, Output prefix: SID16127_S11_L001_R1_001.fastq.gz [Wed 2022-07-20 20:24:42] [INFO ] Config: hg38 Loaded [Wed 2022-07-20 20:24:42] [INFO ] 32 CPU cores availble, using 4 [Wed 2022-07-20 20:24:42] [INFO ] Align RNA-seq reads to reference genome .. [Wed 2022-07-20 20:59:58] [INFO ] Estimate gene abundance .. [Wed 2022-07-20 21:03:35] [INFO ] No circRNA information provided, run CIRI2 for junction site prediction .. [Wed 2022-07-20 21:03:35] [INFO ] Running BWA-mem mapping candidate reads .. [Wed 2022-07-20 22:27:34] [INFO ] Running CIRI2 for circRNA detection .. [Wed 2022-07-20 22:55:27] [INFO ] Extract circular sequence [Wed 2022-07-20 22:55:33] [100% ] [##################################################] [Wed 2022-07-20 22:55:33] [INFO ] Building circular index .. [Wed 2022-07-20 22:55:59] [INFO ] De novo alignment for circular RNAs .. [Wed 2022-07-20 23:25:22] [INFO ] Detecting reads containing Back-splicing signals [Wed 2022-07-20 23:36:15] [INFO ] Detecting FSJ reads from genome alignment file

Error.log below

[Wed 2022-07-20 20:24:42] [INFO ] Input reads: SID16127_S11_L001_R1_001.fastq.gz,SID16127_S11_L001_R2_001.fastq.gz [Wed 2022-07-20 20:24:42] [INFO ] Library type: unstranded [Wed 2022-07-20 20:24:42] [INFO ] Output directory: /home/serbe204/Documents/ciri2, Output prefix: SID16127_S11_L001_R1_001.fastq.gz [Wed 2022-07-20 20:24:42] [INFO ] Config: hg38 Loaded [Wed 2022-07-20 20:24:42] [INFO ] 32 CPU cores availble, using 4 [Wed 2022-07-20 20:24:42] [INFO ] Align RNA-seq reads to reference genome .. Time loading forward index: 00:00:01 Time loading reference: 00:00:00 Multiseed full-index search: 00:31:20 112163215 reads; of these: 112163215 (100.00%) were paired; of these: 58311206 (51.99%) aligned concordantly 0 times 51156655 (45.61%) aligned concordantly exactly 1 time 2695354 (2.40%) aligned concordantly >1 times

58311206 pairs aligned concordantly 0 times; of these:
  146839 (0.25%) aligned discordantly 1 time
----
58164367 pairs aligned 0 times concordantly or discordantly; of these:
  116328734 mates make up the pairs; of these:
    110775545 (95.23%) aligned 0 times
    5267282 (4.53%) aligned exactly 1 time
    285907 (0.25%) aligned >1 times

50.62% overall alignment rate Time searching: 00:31:20 Overall time: 00:31:21 [bam_sort_core] merging from 104 files and 4 in-memory blocks... [Wed 2022-07-20 20:59:58] [INFO ] Estimate gene abundance .. [Wed 2022-07-20 21:03:35] [INFO ] No circRNA information provided, run CIRI2 for junction site prediction .. [Wed 2022-07-20 21:03:35] [INFO ] Running BWA-mem mapping candidate reads .. [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 298616 sequences (40000253 bp)... [M::process] read 298588 sequences (40000147 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (3, 60825, 27, 2) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation FR...

...

[M::mem_process_seqs] Processed 23008 reads in 2.423 CPU sec, 0.569 real sec [main] Version: 0.7.17-r1188 [main] CMD: /home/serbe204/anaconda3/envs/CIRI/bin/bwa mem -t 4 -T 19 /home/serbe204/Documents/GRCh38_Ensemble/Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa /home/serbe204/Documents/RNASeq-iPS-ADAR2_KD/SID16127_S11_L001_R1_001.fastq.gz /home/serbe204/Documents/RNASeq-iPS-ADAR2_KD/SID16127_S11_L001_R2_001.fastq.gz [main] Real time: 5038.921 sec; CPU: 20444.558 sec [Wed 2022-07-20 22:27:34] [INFO ] Running CIRI2 for circRNA detection .. [Wed Jul 20 22:27:34 2022] CIRI begins running [Wed Jul 20 22:27:34 2022] Loading reference [Wed Jul 20 22:27:45 2022] Requesting system to split SAM into 4 pieces [Wed Jul 20 22:27:34 2022] CIRI begins running [Wed Jul 20 22:27:34 2022] Loading reference [Wed Jul 20 22:27:45 2022] Requesting system to split SAM into 4 pieces Divided SAM sizes: /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_unmapped.samaa 30448813406 /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_unmapped.samab 30448813406 /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_unmapped.samac 30448813406 /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_unmapped.samad 30448813403 SAM was divided successfully. First read of divided SAM files: SID16127_S11_L001_R1_001.fastq.gz_unmapped.samab: A00722:78:HLC7TDSX2:1:1377:6632:13041 SID16127_S11_L001_R1_001.fastq.gz_unmapped.samac: A00722:78:HLC7TDSX2:1:1675:31403:35117 SID16127_S11_L001_R1_001.fastq.gz_unmapped.samad: A00722:78:HLC7TDSX2:1:2377:32235:16579 SID16127_S11_L001_R1_001.fastq.gz_unmapped.samaa: A00722:78:HLC7TDSX2:1:1101:1181:1000 First reads were recorded successfully. [Wed Jul 20 22:30:13 2022] First scanning [Wed Jul 20 22:30:13 2022] First scanning Worker 1 begins to scan SID16127_S11_L001_R1_001.fastq.gz_unmapped.samac. Worker 2 begins to scan SID16127_S11_L001_R1_001.fastq.gz_unmapped.samab. Worker 3 begins to scan SID16127_S11_L001_R1_001.fastq.gz_unmapped.samaa. Worker 4 begins to scan SID16127_S11_L001_R1_001.fastq.gz_unmapped.samad. Worker 1 finished reporting. Worker 2 finished reporting. Worker 3 finished reporting. Worker 4 finished reporting. Candidate reads with splicing signals: 12959 Candidate reads with PEM signals: 12103 Candidate circRNAs found: 3036 [Wed Jul 20 22:41:25 2022] Second scanning Candidate reads with splicing signals: 12959 Candidate reads with PEM signals: 12103 Candidate circRNAs found: 3036 [Wed Jul 20 22:41:25 2022] Second scanning Worker 5 begins to scan SID16127_S11_L001_R1_001.fastq.gz_unmapped.samac. Worker 6 begins to scan SID16127_S11_L001_R1_001.fastq.gz_unmapped.samab. Worker 7 begins to scan SID16127_S11_L001_R1_001.fastq.gz_unmapped.samaa. Worker 8 begins to scan SID16127_S11_L001_R1_001.fastq.gz_unmapped.samad. Worker 5 finished reporting. Worker 6 finished reporting. Worker 7 finished reporting. Worker 8 finished reporting. [Wed Jul 20 22:53:54 2022] Extracting info from temporary files Additional candidate reads found: 2846 Additional candidate reads with PEM signals: 2165 [Wed Jul 20 22:55:20 2022] Summarizing Number of circular RNAs found: 2821 [Wed Jul 20 22:53:54 2022] Extracting info from temporary files Additional candidate reads found: 2846 Additional candidate reads with PEM signals: 2165 [Wed Jul 20 22:55:20 2022] Summarizing Number of circular RNAs found: 2821 [Wed Jul 20 22:55:25 2022] CIRI finished its work. Please see output file /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz.ciri for detail. [Wed Jul 20 22:55:25 2022] CIRI finished its work. Please see output file /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz.ciri for detail.

[Wed 2022-07-20 22:55:27] [INFO ] Extract circular sequence [Wed 2022-07-20 22:55:33] [INFO ] Building circular index .. Settings: Output files: "/home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_index..ht2" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Local sequence length: 57344 Local sequence overlap between two consecutive indexes: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_index.fa Reading reference sizes Time reading reference sizes: 00:00:00 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:00 Time to read SNPs and splice sites: 00:00:00 Using parameters --bmax 1914635 --dcv 1024 Doing ahead-of-time memory usage test Passed! Constructing with these parameters: --bmax 1914635 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample Building sPrime Building sPrimeOrder V-Sorting samples V-Sorting samples time: 00:00:01 Allocating rank array Ranking v-sort output Ranking v-sort output time: 00:00:00 Invoking Larsson-Sadakane on ranks Invoking Larsson-Sadakane on ranks time: 00:00:00 Sanity-checking and returning Building samples Reserving space for 44 sample suffixes Generating random suffixes QSorting 44 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 44 samples (Using difference cover) Multikey QSorting samples time: 00:00:00 Calculating bucket sizes Splitting and merging Splitting and merging time: 00:00:00 Split 6, merged 21; iterating... Splitting and merging Splitting and merging time: 00:00:00 Split 4, merged 2; iterating... Splitting and merging Splitting and merging time: 00:00:00 Split 2, merged 3; iterating... Splitting and merging Splitting and merging time: 00:00:00 Split 1, merged 2; iterating... Splitting and merging Splitting and merging time: 00:00:00 Split 1, merged 1; iterating... Avg bucket size: 1.36152e+06 (target: 1914634) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering GFM loop Getting block 1 of 30 Reserving size (1914635) for bucket 1 Getting block 2 of 30 Getting block 3 of 30 Reserving size (1914635) for bucket 3 Reserving size (1914635) for bucket 2 Calculating Z arrays for bucket 1 Getting block 4 of 30 Calculating Z arrays for bucket 2 Entering block accumulator loop for bucket 1: Reserving size (1914635) for bucket 4 Calculating Z arrays for bucket 3 Entering block accumulator loop for bucket 2: Calculating Z arrays for bucket 4 Entering block accumulator loop for bucket 3: Entering block accumulator loop for bucket 4: bucket 1: 10% bucket 2: 10% bucket 4: 10% bucket 3: 10% bucket 1: 20% bucket 2: 20% bucket 4: 20% bucket 3: 20% bucket 1: 30% bucket 2: 30% bucket 4: 30% bucket 3: 30% bucket 1: 40% bucket 2: 40% bucket 4: 40% bucket 1: 50% bucket 3: 40% bucket 2: 50% bucket 4: 50% bucket 1: 60% bucket 2: 60% bucket 3: 50% bucket 4: 60% bucket 1: 70% bucket 2: 70% bucket 4: 70% bucket 3: 60% bucket 1: 80% bucket 2: 80% bucket 4: 80% bucket 1: 90% bucket 3: 70% bucket 2: 90% bucket 4: 90% bucket 1: 100% Sorting block of length 1461040 for bucket 1 (Using difference cover) bucket 3: 80% bucket 2: 100% Sorting block of length 830206 for bucket 2 (Using difference cover) bucket 4: 100% Sorting block of length 1198780 for bucket 4 (Using difference cover) bucket 3: 90% bucket 3: 100% Sorting block of length 1508491 for bucket 3 (Using difference cover) Sorting block time: 00:00:00 Returning block of 830207 for bucket 2 Getting block 5 of 30 Reserving size (1914635) for bucket 5 Calculating Z arrays for bucket 5 Entering block accumulator loop for bucket 5: bucket 5: 10% bucket 5: 20% bucket 5: 30% bucket 5: 40% bucket 5: 50% bucket 5: 60% bucket 5: 70% bucket 5: 80% bucket 5: 90% Sorting block time: 00:00:01 Returning block of 1198781 for bucket 4 Getting block 6 of 30 Reserving size (1914635) for bucket 6 Calculating Z arrays for bucket 6 Entering block accumulator loop for bucket 6: bucket 5: 100% Sorting block of length 1730753 for bucket 5 (Using difference cover) bucket 6: 10% bucket 6: 20% bucket 6: 30% bucket 6: 40% Sorting block time: 00:00:01 Returning block of 1461041 for bucket 1 Getting block 7 of 30 Reserving size (1914635) for bucket 7 Calculating Z arrays for bucket 7 Entering block accumulator loop for bucket 7: bucket 6: 50% bucket 7: 10% bucket 6: 60% bucket 7: 20% bucket 6: 70% Sorting block time: 00:00:01 Returning block of 1508492 for bucket 3 bucket 7: 30% Getting block 8 of 30 Reserving size (1914635) for bucket 8 Calculating Z arrays for bucket 8 Entering block accumulator loop for bucket 8: bucket 6: 80% bucket 7: 40% bucket 6: 90% bucket 8: 10% bucket 7: 50% bucket 6: 100%

...

bucket 30: 100% Sorting block of length 788849 for bucket 30 (Using difference cover) bucket 29: 90% bucket 29: 100% Sorting block of length 1408830 for bucket 29 (Using difference cover) Sorting block time: 00:00:01 Returning block of 1288403 for bucket 28 Sorting block time: 00:00:01 Returning block of 788850 for bucket 30 Sorting block time: 00:00:01 Returning block of 1408831 for bucket 29 Exited GFM loop fchr[A]: 0 fchr[C]: 12577814 fchr[G]: 20416140 fchr[T]: 28255216 fchr[$]: 40845540 Exiting GFM::buildToDisk() Returning from initFromVector Wrote 19213931 bytes to primary GFM file: /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_index.1.ht2 Wrote 10211392 bytes to secondary GFM file: /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_index.2.ht2 Re-opening _in1 and _in2 as input streams Returning from GFM constructor Returning from initFromVector Wrote 34288037 bytes to primary GFM file: /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_index.5.ht2 Wrote 10304266 bytes to secondary GFM file: /home/serbe204/Documents/ciri2/circ/SID16127_S11_L001_R1_001.fastq.gz_index.6.ht2 Re-opening _in5 and _in5 as input streams Returning from HGFM constructor Headers: len: 40845540 gbwtLen: 40845541 nodes: 40845541 sz: 10211385 gbwtSz: 10211386 lineRate: 6 offRate: 4 offMask: 0xfffffff0 ftabChars: 10 eftabLen: 0 eftabSz: 0 ftabLen: 1048577 ftabSz: 4194308 offsLen: 2552847 offsSz: 10211388 lineSz: 64 sideSz: 64 sideGbwtSz: 48 sideGbwtLen: 192 numSides: 212738 numLines: 212738 gbwtTotLen: 13615232 gbwtTotSz: 13615232 reverse: 0 linearFM: Yes Total time for call to driver() for forward index: 00:00:26 [Wed 2022-07-20 22:55:59] [INFO ] De novo alignment for circular RNAs .. 112163215 reads; of these: 112163215 (100.00%) were paired; of these: 78357984 (69.86%) aligned concordantly 0 times 7892485 (7.04%) aligned concordantly exactly 1 time 25912746 (23.10%) aligned concordantly >1 times

78357984 pairs aligned concordantly 0 times; of these:
  2036 (0.00%) aligned discordantly 1 time
----
78355948 pairs aligned 0 times concordantly or discordantly; of these:
  156711896 mates make up the pairs; of these:
    153229929 (97.78%) aligned 0 times
    346892 (0.22%) aligned exactly 1 time
    3135075 (2.00%) aligned >1 times

31.69% overall alignment rate [bam_sort_core] merging from 164 files and 4 in-memory blocks... [Wed 2022-07-20 23:25:22] [INFO ] Detecting reads containing Back-splicing signals [Wed 2022-07-20 23:36:15] [INFO ] Detecting FSJ reads from genome alignment file

Kevinzjy commented 1 year ago

Hi @stefanerb89 ,

CIRIquant seems to be stuck at some BSJ sites that have an enormous number of FSJ reads. If you need the FSJ read number, you have to wait for it to finish ...

However, if FSJ reads are not necessary, the bioinfo-biols/CIRIquant@b29976193ddd2205281ac4b2880fa19082bae67b include the "--no-fsj" option to disable FSJ counting. You could use it to skip the FSJ estimation step. You will have to clone the repository and install it from the source code.

bj-w commented 3 months ago

I'm also having this issue on some samples. Is it possible to check the progress of this process i.e. check it hasn't crashed?