EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
97 stars 18 forks source link

mikado serialise: sqlalchemy.exc.InvalidRequestError: No transaction is begun. #206

Closed alelim-bio closed 5 years ago

alelim-bio commented 5 years ago

Hello @lucventurini ,

I have moved onto mikado serialise and have ran into another error. I was wondering if I could get your help?

Below is the error that I am currently getting.

2019-08-05 18:12:49,628 - main - __init__.py:123 - ERROR - main - MainProcess - Mikado crashed, cause:
2019-08-05 18:12:49,629 - main - __init__.py:124 - ERROR - main - MainProcess - No transaction is begun.
Traceback (most recent call last):
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/__init__.py", line 109, in main
    args.func(args)
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/subprograms/serialise.py", line 336, in serialise
    load_orfs(args, logger)
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/subprograms/serialise.py", line 145, in load_orfs
    serializer()
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/serializers/orf.py", line 317, in __call__
    self.serialize()
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/serializers/orf.py", line 291, in serialize
    self.session.commit()
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1024, in commit
    raise sa_exc.InvalidRequestError("No transaction is begun.")
sqlalchemy.exc.InvalidRequestError: No transaction is begun.

Here is the command I am currently using:

mikado serialise --json-conf configuration.yaml --xml mikado.blast.xml.gz --orfs ../stringtie/longest_orfs.pep.transdecoder.bed

Additionally, I have attached a toy dataset of what I am working with.

Kind Regards,

Alex Lim

Question unrelated to the error if you have some time:

I was wondering if it was possible to learn what was done to generate the .gff3 file from the Trinity transcripts using GMAP. If my understanding is correct usually you get a standard alignment file in (.sam/.bam) format?

toy_bed.zip

lucventurini commented 5 years ago

Hi @AsclepiusDoc, many thanks for reporting this. I will try to replicate and solve it ASAP.

lucventurini commented 5 years ago

Dear @AsclepiusDoc , I was able to reproduce the bug. I will fix it today. However, the bug was triggered because you provided Mikado serialise with the wrong input file - ie ORFs called on the input transcript assemblies for Mikado prepare (StringTie, if I am not mistaken) rather than the output of Mikado prepare itself.

To avoid this, I would recommend using the pipeline manager we developed - Daijin, included in Mikado - to perform the analysis on cotton. Specifically, Daijin can be used to run all the mikado steps.

In order: 1- run mikado configure, as you did before, but adding the --daijin switch. a- if you want to use TransDecoder instead of Prodigal, please add --use-transdecoder to the command line b- if you want to use BLAST+ instead of DIAMOND, please add --use-blast to the command line 2- run daijin to perform the Mikado pipeline:

daijin mikado --jobs {maximum number of jobs} --cores {# of cores} --threads {# of cores} -nd --nolock configuration.yaml

This will ensure that the steps are executed correctly. Further information can be found on the online documentation.

Thank you for reporting the bug!

lucventurini commented 5 years ago

PS:

I was wondering if it was possible to learn what was done to generate the .gff3 file from the Trinity transcripts using GMAP. If my understanding is correct usually you get a standard alignment file in (.sam/.bam) format?

gmap can return a GFF3 output if called with the -f 2 switch.

lucventurini commented 5 years ago

Dear @AsclepiusDoc , now Mikado will die informatively when a situation like yours arises; this should make it easier to spot a problem in the pipeline's usage.

The rationale is that if Mikado had not died during serialisation, the data present in the database would have been useless - Mikado would have had no way to link the ORFs with the transcripts present in the GTF (as their names change due to the juxtaposition of a label).

alelim-bio commented 5 years ago

Hello @lucventurini ,

Thank you so much for your help! That did the trick and I was able to get Mikado outputs. Additionally, I followed your suggestion and was able to get a Trinity output for Mikado however, I seem to have ran into another parsing issue with the Trinity .gff3 file. I have attached the error here if that is okay?

Error:

2019-08-07 13:05:09,936 - prepare - prepare.py:496 - ERROR - prepare - MainProcess - tr_TRINITY_GG_1628_c45_g1_i1.mrna1 is present multiple times in /pylon5/mc5fr6p/alelim/ACTIVE_ANALYSIS/cotton/transcript_assembly/SA_1766/genome_guided_trinity/gmap_alignment/cleaned_trinity_gmap.gff3(label: tr). This breaks the input parsing.
        Please ensure that the file is properly formatted, with unique IDs.
Traceback (most recent call last):
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/prepare.py", line 466, in prepare
    max_intron=args.json_conf["prepare"]["max_intron_length"],)
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/prepare.py", line 373, in load_exon_lines
    _load_exon_lines_single_thread(args, shelve_names, logger, min_length, strip_cds, max_intron)
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/prepare.py", line 270, in _load_exon_lines_single_thread
    strand_specific=strand_specific or is_reference)
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/annotation_parser.py", line 438, in load_from_gff
    __raise_invalid(tid, gff_handle.name, label)
  File "/pylon5/mc5fr6p/alelim/CONDA/anaconda3/envs/bio/lib/python3.6/site-packages/Mikado/preparation/annotation_parser.py", line 169, in __raise_invalid
    "(label: {0})".format(label) if label != '' else ""))
Mikado.exceptions.InvalidAssembly: tr_TRINITY_GG_1628_c45_g1_i1.mrna1 is present multiple times in /pylon5/mc5fr6p/alelim/ACTIVE_ANALYSIS/cotton/transcript_assembly/SA_1766/genome_guided_trinity/gmap_alignment/cleaned_trinity_gmap.gff3(label: tr). This breaks the input parsing.
        Please ensure that the file is properly formatted, with unique IDs.

Additionally, I grepped the string that was reported to be non-unique and got the following results:

10000001        SA1766  mRNA    38780180        40626956        .       -       .       ID=TRINITY_GG_1628_c45_g1_i1.mrna1;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.path1;coverage=100.0;identity=99.6;matches=232;mismatches=1;indels=0;unknowns=0
10000001        SA1766  exon    40626876        40626956        100     -       .       ID=TRINITY_GG_1628_c45_g1_i1.mrna1.exon1;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 81 1 .
10000001        SA1766  exon    38785950        38785957        100     -       .       ID=TRINITY_GG_1628_c45_g1_i1.mrna1.exon2;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 97 90 .
10000001        SA1766  exon    38781015        38781024        100     -       .       ID=TRINITY_GG_1628_c45_g1_i1.mrna1.exon3;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 107 98 .
1       SA1766  exon    38780180        38780313        99      -       .       ID=TRINITY_GG_1628_c45_g1_i1.mrna1.exon4;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 243 110 .
10000001        SA1766  CDS     40626876        40626913        100     -       0       ID=TRINITY_GG_1628_c45_g1_i1.mrna1.cds1;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 81 44 .
10000001        SA1766  CDS     38785950        38785957        100     -       1       ID=TRINITY_GG_1628_c45_g1_i1.mrna1.cds2;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 97 90 .
10000001        SA1766  CDS     38781015        38781024        100     -       0       ID=TRINITY_GG_1628_c45_g1_i1.mrna1.cds3;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 107 98 .
10000001        SA1766  CDS     38780296        38780313        100     -       0       ID=TRINITY_GG_1628_c45_g1_i1.mrna1.cds4;Name=TRINITY_GG_1628_c45_g1_i1;Parent=TRINITY_GG_1628_c45_g1_i1.mrna1;Target=TRINITY_GG_1628_c45_g1_i1 127 110 .

Furthermore, I will look into the Daijin pipeline as you have recommended. Once again thank you for all your help!

Kind Regards,

Alex Lim

lucventurini commented 5 years ago

Dear @AsclepiusDoc, the problem is that one of the lines is on chromosome 1 (rather than 10000001). That is clearly a mistake on GMAP part, as the culprit exon is actually missing from the correct chromosome (it's the starting exon of the model).

Mikado presumes that models should belong to the same scaffold /chromosome, so this is flagged as a clear a mistake. I do not know why it happened; it looks like a bug in GMAP. I would try a different version of the program to see whether the bug represents itself again.

Glad to hear that my suggestions solved the problem for you!