Xinglab / isoCirc

isoCirc
GNU General Public License v3.0
10 stars 4 forks source link

Short-read error correction feature #6

Closed sidizhao closed 2 years ago

sidizhao commented 2 years ago

Hi there,

I've been able to run isoCirc with isocirc -t 1 $PATH/FAQ07459_pass_a4f2108a.fastq $PATH/all-chrs.fa $PATH/hg38_ref_all.gtf $PATH/annotation.bed $PATH/isocirc_output and get intended results. However, when I tried to add the --short parameter to the run command, it seems to me that there is either a really long run time, or I'm not running it correctly.

I have 2 sets of pair-end short-read sequencing data, and here's the command I used:

isocirc -t 1 --short-read $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq $PATH/FAQ07459_pass_a4f2108a.fastq $PATH/all-chrs.fa $PATH/hg38_ref_all.gtf $PATH/annotation.bed $PATH/isocirc_outputP_short_read

When I look into the log of the run, it shows this (path edited and omitted similar lines for simplicity):

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-f2h0a7nc because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
== 00:47:41-Sep-07-2021 == [check_dependencies] Checking dependencies ...
== 00:47:41-Sep-07-2021 == [check_dependencies] Checking dependencies done!
== 00:47:41-Sep-07-2021 == [Error-correction] Hybrid error correction using $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq ...
== 00:47:41-Sep-07-2021 == [LoRDEC] lordec-correct -2 $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq -i $PATH/FAQ07459_pass_a4f2108a.fastq -o $PATH/isocirc_output_short_read/long_corrected.fa -k 21 -s 3 -T 1
-2
$PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq
-i
$PATH/FAQ07459_pass_a4f2108a.fastq
-o
$PATH/long_corrected.fa
-k
21
-s
3
-T
1
illumina: $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq $PATH/HCT116_Illumina_TruSeq_R1.fastq_multi_k21_s3.h5 pacbioFile: $PATH/FAQ07459_pass_a4f2108a.fastq
kmer_len: 21 solid_kmer_thr: 3
max_trials: 5 max_error_rate: 0.4 max_branch: 200
abundance_max: 2147483647
Cannot access the graph file for reference reads: $PATH/HCT116_Illumina_TruSeq_R1.fastq_multi_k21_s3.h5
bRefGraph: 0
bRefSeq: 1
creating the graph from file(s): $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq

[DSK: counting kmers                     ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [  28,   28,   76] MB 
[DSK: Pass 1/1, Step 1: partitioning     ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [  46,   46,   76] MB 
[DSK: Pass 1/1, Step 1: partitioning     ]  1    %   elapsed:   0 min 20 sec   remaining:  32 min 17 sec   cpu:  99.2 %   mem: [ 494,  494,  494] MB 
[DSK: Pass 1/1, Step 1: partitioning     ]  2    %   elapsed:   0 min 39 sec   remaining:  32 min 8  sec   cpu:  99.4 %   mem: [ 575,  575,  575] MB 
[DSK: Pass 1/1, Step 1: partitioning     ]  3    %   elapsed:   0 min 59 sec   remaining:  31 min 47 sec   cpu:  99.3 %   mem: [ 575,  575,  575] MB 
...
[DSK: Pass 1/1, Step 2: counting kmers   ]  50.4 %   elapsed:  16 min 47 sec   remaining:  16 min 30 sec   cpu:  99.2 %   mem: [  96,  608,  608] MB 
[DSK: Pass 1/1, Step 2: counting kmers   ]  53.5 %   elapsed:  17 min 56 sec   remaining:  15 min 36 sec   cpu:  99.2 %   mem: [4298, 4298, 4328] MB 
[DSK: Pass 1/1, Step 2: counting kmers   ]  53.5 %   elapsed:  17 min 56 sec   remaining:  15 min 36 sec   cpu:  99.2 %   mem: [4298, 4298, 4328] MB 
...
[DSK: nb solid kmers found : 156051682   ]  101  %   elapsed:  36 min 56 sec   remaining:   0 min 0  sec   cpu:  99.4 %   mem: [1378, 5928, 5960] MB 

[Building BooPHF]  0.1  %   elapsed:   0 min 0  sec   remaining:   4 min 34 sec
[Building BooPHF]  0.2  %   elapsed:   0 min 0  sec   remaining:   3 min 32 sec
[Building BooPHF]  0.3  %   elapsed:   0 min 1  sec   remaining:   3 min 57 sec
...

[MPHF: populate                          ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [1583, 1583, 5960] MB 
[MPHF: populate                          ]  2    %   elapsed:   0 min 1  sec   remaining:   0 min 56 sec   cpu:  99.1 %   mem: [1583, 1583, 5960] MB 
[MPHF: populate                          ]  3    %   elapsed:   0 min 2  sec   remaining:   0 min 55 sec   cpu: 100.0 %   mem: [1583, 1583, 5960] MB 
...

[Bloom: read solid kmers                 ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [1910, 1910, 5960] MB 
[Bloom: read solid kmers                 ]  2    %   elapsed:   0 min 1  sec   remaining:   1 min 12 sec   cpu: 100.0 %   mem: [1910, 1910, 5960] MB 
[Bloom: read solid kmers                 ]  3    %   elapsed:   0 min 2  sec   remaining:   1 min 6  sec   cpu: 100.0 %   mem: [1910, 1910, 5960] MB 
...

[Debloom: finalization                   ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [2225, 2225, 5960] MB 
[Debloom: finalization                   ]  2    %   elapsed:   0 min 0  sec   remaining:   0 min 23 sec   cpu: 100.0 %   mem: [2273, 2273, 5960] MB 
[Debloom: finalization                   ]  3    %   elapsed:   0 min 1  sec   remaining:   0 min 22 sec   cpu:  98.6 %   mem: [2298, 2298, 5960] MB 
...

[Debloom: save                           ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [2241, 2241, 5960] MB 
[Debloom: save                           ]  2    %   elapsed:   0 min 1  sec   remaining:   0 min 58 sec   cpu:  99.2 %   mem: [2241, 2241, 5960] MB 
[Debloom: save                           ]  3    %   elapsed:   0 min 2  sec   remaining:   0 min 57 sec   cpu:  99.4 %   mem: [2241, 2241, 5960] MB 
...

[Graph: build branching nodes            ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu:  -1.0 %   mem: [1980, 1980, 5960] MB 
[Graph: build branching nodes            ]  2    %   elapsed:   0 min 8  sec   remaining:   6 min 56 sec   cpu:  99.8 %   mem: [1980, 1980, 5960] MB 
[Graph: build branching nodes            ]  3    %   elapsed:   0 min 13 sec   remaining:   6 min 51 sec   cpu:  99.8 %   mem: [1980, 1980, 5960] MB 
...

[Graph: nb branching found : 28171957    ]  100  %   elapsed:   7 min 11 sec   remaining:   0 min 0  sec   cpu:  99.8 %   mem: [2410, 2410, 5960] MB 
!!! file present : $PATH/TruSeq_R1.fastq_multi_k21_s3.h5
graph created

It seems to me that only one of the short-read files were used, and there hasn't been any more lines printed to the output file. It still says the job is running, but I don't see it proceeding to the next step (finding TRFs).

I also have a question. With this line illumina: $PATH/TruSeq_R1.fastq,$PATH/TruSeq_R2.fastq,$PATH/New_England_R1.fastq,$PATH/New_England_R2.fastq $PATH/HCT116_Illumina_TruSeq_R1.fastq_multi_k21_s3.h5 pacbioFile: $PATH/FAQ07459_pass_a4f2108a.fastq kmer_len: 21 solid_kmer_thr: 3 it looks like it's taking my long read data as PacBio generated. Mine is actually nanopore. Is there anywhere I can specify that?

Really appreciate your help! Please advise me on what I should do next.

yangao07 commented 2 years ago

The error correction step by lordec is very slow, it may need some more time to finish. Lordec only use the first short-read file to name the generated graph file (.h5), but it should contain all the short-read sequence information.

The 'pacbioFile' message was from the lordec program . This error correction program was initially designed for pacbio long reads, but it also shows good performance on nanopore reads. There isn't any option to specify the read type for lordec, so basically you are doing the right thing.

Yan

sidizhao commented 2 years ago

Thank you so much. As of now, it took more than 21 hours but the program is moving forward. It’s looking for TRFs now. I think it’s functioning well.