Closed alelim-bio closed 5 years ago
Hi @AsclepiusDoc, many thanks for reporting this. I will try to replicate and solve it ASAP.
Dear @AsclepiusDoc , I was able to reproduce the bug. I will fix it today. However, the bug was triggered because you provided Mikado serialise with the wrong input file - ie ORFs called on the input transcript assemblies for Mikado prepare (StringTie, if I am not mistaken) rather than the output of Mikado prepare itself.
To avoid this, I would recommend using the pipeline manager we developed - Daijin, included in Mikado - to perform the analysis on cotton. Specifically, Daijin can be used to run all the mikado steps.
In order:
1- run mikado configure, as you did before, but adding the --daijin
switch.
a- if you want to use TransDecoder
instead of Prodigal
, please add --use-transdecoder
to the command line
b- if you want to use BLAST+
instead of DIAMOND
, please add --use-blast
to the command line
2- run daijin to perform the Mikado pipeline:
daijin mikado --jobs {maximum number of jobs} --cores {# of cores} --threads {# of cores} -nd --nolock configuration.yaml
This will ensure that the steps are executed correctly. Further information can be found on the online documentation.
Thank you for reporting the bug!
PS:
I was wondering if it was possible to learn what was done to generate the .gff3 file from the Trinity transcripts using GMAP. If my understanding is correct usually you get a standard alignment file in (.sam/.bam) format?
gmap
can return a GFF3 output if called with the -f 2
switch.
Dear @AsclepiusDoc , now Mikado will die informatively when a situation like yours arises; this should make it easier to spot a problem in the pipeline's usage.
The rationale is that if Mikado had not died during serialisation, the data present in the database would have been useless - Mikado would have had no way to link the ORFs with the transcripts present in the GTF (as their names change due to the juxtaposition of a label).
Hello @lucventurini ,
Thank you so much for your help! That did the trick and I was able to get Mikado outputs. Additionally, I followed your suggestion and was able to get a Trinity output for Mikado however, I seem to have ran into another parsing issue with the Trinity .gff3 file. I have attached the error here if that is okay?
Error:
2019-08-07 13:05:09,936 - prepare - prepare.py:496 - ERROR - prepare - MainProcess - tr_TRINITY_GG_1628_c45_g1_i1.mrna1 is present multiple times in /pylon5/mc5fr6p/alelim/ACTIVE_ANALYSIS/cotton/transcript_assembly/SA_1766/genome_guided_trinity/gmap_alignment/cleaned_trinity_gmap.gff3(label: tr). This breaks the input parsing.
Please ensure that the file is properly formatted, with unique IDs.
Traceback (most recent call last):
File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/prepare.py", line 466, in prepare
max_intron=args.json_conf["prepare"]["max_intron_length"],)
File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/prepare.py", line 373, in load_exon_lines
_load_exon_lines_single_thread(args, shelve_names, logger, min_length, strip_cds, max_intron)
File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/prepare.py", line 270, in _load_exon_lines_single_thread
strand_specific=strand_specific or is_reference)
File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/annotation_parser.py", line 438, in load_from_gff
__raise_invalid(tid, gff_handle.name, label)
File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/annotation_parser.py", line 169, in __raise_invalid
"(label: {0})".format(label) if label != '' else ""))
Mikado.exceptions.InvalidAssembly: tr_TRINITY_GG_1628_c45_g1_i1.mrna1 is present multiple times in /pylon5/mc5fr6p/alelim/ACTIVE_ANALYSIS/cotton/transcript_assembly/SA_1766/genome_guided_trinity/gmap_alignment/cleaned_trinity_gmap.gff3(label: tr). This breaks the input parsing.
Please ensure that the file is properly formatted, with unique IDs.
Additionally, I grepped the string that was reported to be non-unique and got the following results:
10000001 SA1766 mRNA 38780180 40626956 . - . ID=TRINITY_GG_1628_c45_g1_i1.mrna1;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.path1;coverage=100.0;identity=99.6;matches=232;mismatches=1;indels=0;unknowns=0
10000001 SA1766 exon 40626876 40626956 100 - . ID=TRINITY_GG_1628_c45_g1_i1.mrna1.exon1;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 81 1 .
10000001 SA1766 exon 38785950 38785957 100 - . ID=TRINITY_GG_1628_c45_g1_i1.mrna1.exon2;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 97 90 .
10000001 SA1766 exon 38781015 38781024 100 - . ID=TRINITY_GG_1628_c45_g1_i1.mrna1.exon3;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 107 98 .
1 SA1766 exon 38780180 38780313 99 - . ID=TRINITY_GG_1628_c45_g1_i1.mrna1.exon4;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 243 110 .
10000001 SA1766 CDS 40626876 40626913 100 - 0 ID=TRINITY_GG_1628_c45_g1_i1.mrna1.cds1;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 81 44 .
10000001 SA1766 CDS 38785950 38785957 100 - 1 ID=TRINITY_GG_1628_c45_g1_i1.mrna1.cds2;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 97 90 .
10000001 SA1766 CDS 38781015 38781024 100 - 0 ID=TRINITY_GG_1628_c45_g1_i1.mrna1.cds3;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 107 98 .
10000001 SA1766 CDS 38780296 38780313 100 - 0 ID=TRINITY_GG_1628_c45_g1_i1.mrna1.cds4;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 127 110 .
Furthermore, I will look into the Daijin pipeline as you have recommended. Once again thank you for all your help!
Kind Regards,
Alex Lim
Dear @AsclepiusDoc, the problem is that one of the lines is on chromosome 1 (rather than 10000001). That is clearly a mistake on GMAP part, as the culprit exon is actually missing from the correct chromosome (it's the starting exon of the model).
Mikado presumes that models should belong to the same scaffold /chromosome, so this is flagged as a clear a mistake. I do not know why it happened; it looks like a bug in GMAP. I would try a different version of the program to see whether the bug represents itself again.
Glad to hear that my suggestions solved the problem for you!
Hello @lucventurini ,
I have moved onto mikado serialise and have ran into another error. I was wondering if I could get your help?
Below is the error that I am currently getting.
Here is the command I am currently using:
mikado serialise --json-conf configuration.yaml --xml mikado.blast.xml.gz --orfs ../stringtie/longest_orfs.pep.transdecoder.bed
Additionally, I have attached a toy dataset of what I am working with.
Kind Regards,
Alex Lim
Question unrelated to the error if you have some time:
I was wondering if it was possible to learn what was done to generate the .gff3 file from the Trinity transcripts using GMAP. If my understanding is correct usually you get a standard alignment file in (.sam/.bam) format?
toy_bed.zip