EI-CoreBioinformatics / reat

Robust Eukaryotic Annotation Toolkit
https://reat.readthedocs.io/en/latest/
MIT License
17 stars 3 forks source link

Error + Fix minimap issue mapping to large genomes #30

Closed swarbred closed 2 years ago

swarbred commented 2 years ago

We are running REAT (transcriptome, homology and prediction) on a wheat assembly ~15Gb, I've previously run REAT transcriptome with illumina data but this is the first run with long reads

error triggered at call-Minimap2Long (issue is actually the earlier step call-Minimap2Index)

Error

++ basename -- /ei/.project-scratch/5/5fc1cb64-b88e-47aa-8d79-0f02f235823d/CB-REAT_testing_Triticum_aestivum_julius/Analysis/reat-dev-issue25/Transcriptome/cromwell-executions/ei_annotation/50fbd28a-8f5b-4455-89fc-a0f8c556b788/call-wf_align/wf_align/96766735-6859-426e-8d87-edfbbb161910/call-HQ_align/wf_align_long/a1e163b1-fdb2-4238-b809-9479865f1a3a/call-minimap2/shard-1/wf_mm2/26037c1f-d7f1-415a-be94-e455f29d6603/call-Minimap2Long/shard-0/inputs/-1704965781/JUL_V.PacBio.fasta
+ filename=JUL_V.PacBio.fasta
+ extension=fasta
+ in_pipe='cat /ei/.project-scratch/5/5fc1cb64-b88e-47aa-8d79-0f02f235823d/CB-REAT_testing_Triticum_aestivum_julius/Analysis/reat-dev-issue25/Transcriptome/cromwell-executions/ei_annotation/50fbd28a-8f5b-4455-89fc-a0f8c556b788/call-wf_align/wf_align/96766735-6859-426e-8d87-edfbbb161910/call-HQ_align/wf_align_long/a1e163b1-fdb2-4238-b809-9479865f1a3a/call-minimap2/shard-1/wf_mm2/26037c1f-d7f1-415a-be94-e455f29d6603/call-Minimap2Long/shard-0/inputs/-1704965781/JUL_V.PacBio.fasta'
+ '[' fasta == bam ']'
+ '[' fasta == gz ']'
+ strand_opt=-ub
+ '[' fr-secondstrand == fr-secondstrand ']'
+ strand_opt=-uf
+ mkdir alignments
+ cd alignments
+ cat /ei/.project-scratch/5/5fc1cb64-b88e-47aa-8d79-0f02f235823d/CB-REAT_testing_Triticum_aestivum_julius/Analysis/reat-dev-issue25/Transcriptome/cromwell-executions/ei_annotation/50fbd28a-8f5b-4455-89fc-a0f8c556b788/call-wf_align/wf_align/96766735-6859-426e-8d87-edfbbb161910/call-HQ_align/wf_align_long/a1e163b1-fdb2-4238-b809-9479865f1a3a/call-minimap2/shard-1/wf_mm2/26037c1f-d7f1-415a-be94-e455f29d6603/call-Minimap2Long/shard-0/inputs/-1704965781/JUL_V.PacBio.fasta
+ minimap2 -ax splice:hq --junc-bed /ei/.project-scratch/5/5fc1cb64-b88e-47aa-8d79-0f02f235823d/CB-REAT_testing_Triticum_aestivum_julius/Analysis/reat-dev-issue25/Transcriptome/cromwell-executions/ei_annotation/50fbd28a-8f5b-4455-89fc-a0f8c556b788/call-wf_align/wf_align/96766735-6859-426e-8d87-edfbbb161910/call-HQ_align/wf_align_long/a1e163b1-fdb2-4238-b809-9479865f1a3a/call-minimap2/shard-1/wf_mm2/26037c1f-d7f1-415a-be94-e455f29d6603/call-Minimap2Long/shard-0/inputs/1072613602/combined_junctions.bed --cs=long -G 50000 -uf -t 30 -L --eqx -2 --secondary=no /ei/.project-scratch/5/5fc1cb64-b88e-47aa-8d79-0f02f235823d/CB-REAT_testing_Triticum_aestivum_julius/Analysis/reat-dev-issue25/Transcriptome/cromwell-executions/ei_annotation/50fbd28a-8f5b-4455-89fc-a0f8c556b788/call-wf_align/wf_align/96766735-6859-426e-8d87-edfbbb161910/call-HQ_align/wf_align_long/a1e163b1-fdb2-4238-b809-9479865f1a3a/call-minimap2/shard-1/wf_mm2/26037c1f-d7f1-415a-be94-e455f29d6603/call-Minimap2Long/shard-0/inputs/359518636/reference.san.fasta.mmi -
+ samtools view -F 4 -F 0x900 -bS -
+ samtools sort -@ 15 -m 1G -T minimap2.sort -o minimap2.JUL_V_isoseq_LR.JUL_V.PacBio.bam -
[WARNING]ESC[1;31m For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.ESC[0m
[M::main::69.074*0.51] loaded/built the index for 6 target sequence(s)
[M::mm_mapopt_update::74.165*0.54] mid_occ = 3065
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 6
[M::mm_idx_stat::76.679*0.56] distinct minimizers: 151822278 (42.13% are singletons); average occurrences: 8.922; average spacing: 2.969; total length: 4022006890
[E::sam_parse1] no SQ lines present in the header
[W::sam_read1_sam] Parse error at line 2
samtools view: error reading file "-"

The issue relates to the [WARNING]ESC[1;31m For a multi-part index

As the genome is greater than 4G a mulit-part index is generated

however there would be issues using --split-prefix as suggested as some of the SAM tags are not generated [ERROR]ESC[1;31m --cs or --MD doesn't work with --split-prefixESC[0m

I think we can resolve this by increasing -I at the index creation stage (the default is 4G), I tested with -I 24G for wheat and this completed used ~87G memory so not crazy high.

I suggest -I 128G for minimap in call-Minimap2Index, which will mean we use a uni index for anything we are likely to tackle (It could be put even higher I guess as the pipeline wont work with the multi index anyway)

minimap2 -ax splice:hq \
        -t 8 \
        -I 128G \
        -d reference.san.fasta.mmi

for reference discussion at https://github.com/fenderglass/Flye/issues/318 https://github.com/lh3/minimap2/issues/301