GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

Preparing input files for TAMA #68

Closed mictadlo closed 2 years ago

mictadlo commented 2 years ago

Hi, We have two 2 iso-seq libraries from 2017 and we would like to use them to run TAMA. I used the following commands to filter those files. However, I am not sure whether I need isoseq3 tag and isoseq3 dedup steps. After each step, the file sizes get reduced dramatically. Is it normal or do I do anything wrong or do the libraries have problems?

Furthermore, when should I merge the libraries together and which tool is recommended?

> ccs m54105_171201_020331.subreads.bam m54105_171201_020331.ccs.bam --report-file m54105_171201_020331_ccs_report.txt --report-json m54105_171201_020331_ccs_report.json  --metrics-json m54105_171201_020331_ccs_metric_report.txt --skip-polish --min-passes 1 --num-threads 8
samtools index ${samp}.ccs.bam 
 > cat m54105_171201_020331_ccs_report.txt
ZMWs input               : 206410        

ZMWs pass filters        : 145259 (70.37%)
ZMWs fail filters        : 61151 (29.63%)
ZMWs shortcut filters    : 0 (0.000%)

ZMWs with tandem repeats : 56 (0.092%)

Exclusive failed counts
Below SNR threshold      : 0 (0.000%)
Median length filter     : 0 (0.000%)
Lacking full passes      : 54532 (89.18%)
Heteroduplex insertions  : 463 (0.757%)
Coverage drops           : 26 (0.043%)
Insufficient draft cov   : 5161 (8.440%)
Draft too different      : 0 (0.000%)
Draft generation error   : 957 (1.565%)
Draft above --max-length : 0 (0.000%)
Draft below --min-length : 12 (0.020%)
Reads failed polishing   : 0 (0.000%)
Empty coverage windows   : 0 (0.000%)
CCS did not converge     : 0 (0.000%)
CCS below minimum RQ     : 0 (0.000%)
Unknown error            : 0 (0.000%)

Additional passing metrics
ZMWs missing adapters    : 0 (0.000%)
> lima --isoseq --dump-clips -j 8 m54105_171201_020331.ccs.bam primers.fasta m54105_171201_020331.ccs-lima.bam
> isoseq3 tag m54105_171201_020331.ccs-lima.5p--3p.bam m54105_171201_020331.ccs-lima.5p--3p.flt.bam --design T-8U-12B
> isoseq3 refine m54105_171201_020331.ccs-lima.5p--3p.flt.bam primers.fasta m54105_171201_020331.ccs-lima.5p--3p.fltnc.bam --require-polya
> isoseq3 dedup m54105_171201_020331.ccs-lima.5p--3p.fltnc.bam  m54105_171201_020331.ccs-lima.5p--3p.fltnc.dedup.bam --log-level INFO
| 20220202 04:31:50.329 | INFO | Output DEDUP FLTNC bam: m54105_171201_020331.ccs-lima.5p--3p.fltnc.dedup.bam
| 20220202 04:31:50.402 | INFO | Parse BAM                : (378) 69ms 875us
| 20220202 04:31:50.403 | INFO | Parse BAM                : 75ms 800us
| 20220202 04:31:50.403 | INFO | Peak RSS                 : 0.00718689 GB
| 20220202 04:31:50.405 | INFO | Compare                  : (0) 2ms 315us
| 20220202 04:31:50.420 | INFO | Compare                  : 17ms 754us
| 20220202 04:31:50.420 | INFO | Duplicates (molecules)   : 0(0)
| 20220202 04:31:50.424 | INFO | Consensus                : 4ms 2us
| 20220202 04:31:50.551 | INFO | Write BAM                : 126ms 170us
| 20220202 04:31:50.557 | INFO | Write fasta              : 6ms 355us
| 20220202 04:31:50.557 | INFO | Duplicate molecules      : 0
| 20220202 04:31:50.557 | INFO | Unique molecules         : 189
| 20220202 04:31:50.557 | INFO | Run Time                 : 230ms 140us
| 20220202 04:31:50.557 | INFO | CPU Time                 : 98ms 484us
| 20220202 04:31:50.557 | INFO | Peak RSS                 : 0.0103531 GB
> ls -haltr
-rw-rw---- 1 lorencm default 995K Feb  1 14:55 m54105_171201_020331.ccs.bam.pbi
-rw-rw---- 1 lorencm default 108M Feb  1 14:55 m54105_171201_020331.ccs.bam
-rw-rw---- 1 lorencm default  74M Feb  1 14:55 m54105_171201_020331_ccs_metric_report.txt
-rw-rw---- 1 lorencm default  881 Feb  1 14:56 m54105_171201_020331_ccs_report.txt
-rw-rw---- 1 lorencm default 3.4K Feb  1 14:56 m54105_171201_020331_ccs_report.json
...
-rw-rw---- 1 lorencm default  911 Feb  2 14:27 m54105_171201_020331.ccs-lima.lima.summary
-rw-rw---- 1 lorencm default  26M Feb  2 14:27 m54105_171201_020331.ccs-lima.lima.report
-rw-rw---- 1 lorencm default   87 Feb  2 14:27 m54105_171201_020331.ccs-lima.lima.counts
-rw-rw---- 1 lorencm default 653K Feb  2 14:27 m54105_171201_020331.ccs-lima.lima.clips
-rw-rw---- 1 lorencm default  599 Feb  2 14:27 m54105_171201_020331.ccs-lima.json
-rw-rw---- 1 lorencm default 2.3K Feb  2 14:27 m54105_171201_020331.ccs-lima.consensusreadset.xml
-rw-rw---- 1 lorencm default 2.3K Feb  2 14:27 m54105_171201_020331.ccs-lima.5p--3p.consensusreadset.xml
-rw-rw---- 1 lorencm default  42K Feb  2 14:27 m54105_171201_020331.ccs-lima.5p--3p.bam.pbi
-rw-rw---- 1 lorencm default 3.8M Feb  2 14:27 m54105_171201_020331.ccs-lima.5p--3p.bam
-rw-rw---- 1 lorencm default  13K Feb  2 14:27 m54105_171201_020331.ccs-lima.5p--3p.flt.bam.pbi
-rw-rw---- 1 lorencm default 1.1M Feb  2 14:27 m54105_171201_020331.ccs-lima.5p--3p.flt.bam
-rw-rw---- 1 lorencm default  11K Feb  2 14:31 m54105_171201_020331.ccs-lima.5p--3p.fltnc.report.csv
-rw-rw---- 1 lorencm default   70 Feb  2 14:31 m54105_171201_020331.ccs-lima.5p--3p.fltnc.filter_summary.json
-rw-rw---- 1 lorencm default 1.7K Feb  2 14:31 m54105_171201_020331.ccs-lima.5p--3p.fltnc.consensusreadset.xml
-rw-rw---- 1 lorencm default 2.2K Feb  2 14:31 m54105_171201_020331.ccs-lima.5p--3p.fltnc.bam.pbi
-rw-rw---- 1 lorencm default 166K Feb  2 14:31 m54105_171201_020331.ccs-lima.5p--3p.fltnc.bam
-rw-rw---- 1 lorencm default 467K Feb  2 14:31 m54105_171201_020331.ccs-lima.5p--3p.fltnc.dedup.fasta
-rw-rw---- 1 lorencm default 1.4K Feb  2 14:31 m54105_171201_020331.ccs-lima.5p--3p.fltnc.dedup.bam.pbi
-rw-rw---- 1 lorencm default 154K Feb  2 14:31 m54105_171201_020331.ccs-lima.5p--3p.fltnc.dedup.bam

Thank you in advance,

Michal

GenomeRIK commented 2 years ago

Hi Michal,

It seems Pacbio have changed their terminology again so I am not exactly sure if this is equivalent, but I believe you would take the "fltnc.dedup.fasta" files and map them to the appropriate genome using Minimap2. Then you would feed the resulting bam file into TAMA Collapse and the result of that into TAMA Merge.

It would help if I could get more info on the experimental setup if you are comfortable with sharing that.

Cheers, Richard