EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
97 stars 18 forks source link

IndexError in Mikado Serialise #376

Closed lukesarre closed 3 years ago

lukesarre commented 3 years ago

Hi all, I'm running into a problem using Mikado serialise, wherein the blast serializer is crashing. I'm running Mikado v2.0.2

I created an input file (Amoebidium_mikadoInput_list), which references previous stringtie and gmap alignments:

/data3/scratch/btx832/Annotation/Mikado/mikado_input/EBI_transcripts.gtf    stringtie1  False   0.5
/data3/scratch/btx832/Annotation/Mikado/mikado_input/EBIVsGenome_gmap.gff3  gmap1   False   0.5

I then ran these commands (each alone, submitted to a cluster):

mikado configure -t 6 --list Amoebidium_mikadoInput_list --reference Amoebidium_softmasked_v15_01_2021.fa --mode permissive --scoring human.yaml --copy-scoring vertebrates.yaml --junctions ./mikado_input/portcullis_filtered.pass.junctions.bed -bt /data/SBCS-ademendoza/02-lukesarre/db/Uniprot_SwissProt_020221 configuration.yaml

mikado prepare --json-conf configuration.yaml

diamond blastx --query mikado_prepared.fasta --max-target-seqs 5 --sensitive --index-chunks 1 --threads 10 --db /data/SBCS-ademendoza/02-lukesarre/db/Uniprot_SwissProt_020221.dmnd --evalue 1e-6 --outfmt 5 --out mikado.diamond.xml

TransDecoder.LongOrfs -t mikado_prepared.fasta
TransDecoder.Predict -t mikado_prepared.fasta

mikado serialise --procs 1 --json-conf configuration.yaml --xml mikado.diamond.xml --orfs mikado_prepared.fasta.transdecoder.bed --blast_targets /data/SBCS-ademendoza/02-lukesarre/db/Uniprot_SwissProt_020221.fasta --transcripts mikado_prepared.fasta --junctions ./mikado_input/portcullis_filtered.pass.junctions.bed

The error I am seeing in the stdout looks like this:

Variable OMP_NUM_THREADS has been set to 1
Mikado crashed, cause:
index 429 is out of bounds for axis 1 with size 429, off by one: False; min, max: 16, 429; frame: -1
Traceback (most recent call last):
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/serializers/blast_serializer/hit.py", line 340, in prepare_hit
    query_array[1, posit - off_by_one] = 1
IndexError: index 429 is out of bounds for axis 1 with size 429

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/__main__.py", line 68, in main
    args.func(args)
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/subprograms/serialise.py", line 377, in serialise
    load_blast(args, logger)
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/subprograms/serialise.py", line 125, in load_blast
    part_launcher(filenames)
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/subprograms/serialise.py", line 53, in xml_launcher
    xml_serializer()
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/serializers/blast_serializer/blast_serialiser.py", line 360, in __call__
    self.serialize()
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/serializers/blast_serializer/blast_serialiser.py", line 342, in serialize
    self.__serialise_xmls()
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/serializers/blast_serializer/blast_serialiser.py", line 351, in __serialise_xmls
    _serialise_xmls(self)
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/serializers/blast_serializer/xml_serialiser.py", line 124, in _serialise_xmls
    hits, hsps, cache = objectify_record(
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/serializers/blast_serializer/xml_serialiser.py", line 239, in objectify_record
    hit, hit_hsps = prepare_hit(alignment, current_query,
  File "/data/home/btx832/src/mymikado/venv/lib/python3.8/site-packages/Mikado-2.0.2-py3.8-linux-x86_64.egg/Mikado/serializers/blast_serializer/hit.py", line 342, in prepare_hit
    raise IndexError("{}, off by one: {}; min, max: {}, {}; frame: {}".format(
IndexError: index 429 is out of bounds for axis 1 with size 429, off by one: False; min, max: 16, 429; frame: -1

and the serialise.log looks like this:

2021-02-16 11:55:17,405 - serialiser - serialise.py:301 - INFO - setup - MainProcess - Random seed: 4134769763
2021-02-16 11:55:17,426 - serialiser - serialise.py:330 - INFO - setup - MainProcess - Using a sqlite database (location: /data3/scratch/btx832/Annotation/Mikado/mikado.db)
2021-02-16 11:55:17,427 - serialiser - serialise.py:334 - INFO - setup - MainProcess - Requested 1 threads, forcing single thread: False
2021-02-16 11:55:17,427 - serialiser - serialise.py:142 - INFO - load_orfs - MainProcess - Starting to load ORF data
2021-02-16 11:55:35,301 - serialiser - orf.py:365 - INFO - __serialize_single_thread - MainProcess - Finished loading 70034 ORFs into the database
2021-02-16 11:55:36,372 - serialiser - serialise.py:154 - INFO - load_orfs - MainProcess - Finished loading ORF data
2021-02-16 11:55:36,414 - serialiser - serialise.py:108 - INFO - load_blast - MainProcess - Starting to load BLAST data
2021-02-16 11:55:36,416 - serialiser - blast_serialiser.py:81 - INFO - __init__ - MainProcess - Number of dedicated workers: 1
2021-02-16 11:55:44,740 - serialiser - blast_serialiser.py:244 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2021-02-16 11:56:00,625 - serialiser - blast_serialiser.py:279 - INFO - __serialize_targets - MainProcess - Loaded 563972 objects into the "target" table
2021-02-16 11:56:03,356 - serialiser - blast_serialiser.py:173 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2021-02-16 11:56:03,447 - serialiser - blast_serialiser.py:221 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2021-02-16 11:56:04,334 - serialiser - blast_serialiser.py:228 - INFO - __serialize_queries - MainProcess - 193941 in queries

Any help would be much appreciated! I have the input files available, but am unsure how to share as they are ~160mb in a zip file.

Thank you, Luke

lucventurini commented 3 years ago

Hi @lukesarre ,

May I ask which version of DIAMOND are you using? This might be related to a bug present in DIAMOND versions 0.9.26 up to 0.9.30 (see here).

lukesarre commented 3 years ago

Hi @lucventurini , Thank you for your fast response! We are using v0.9.22 All the best, Luke

swarbred commented 3 years ago

@lucventurini

Do we support --outfmt 5 (XML specifically from diamond)

the preferred output format is

--outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop

lucventurini commented 3 years ago

@swarbred

Do we support --outfmt 5 (XML specifically from diamond)

Yes, it is still supported.

@lukesarre

Thank you for your fast response! We are using v0.9.22

Probably that's the origin of the problem .. would it be possible for you to repeat using a more recent version of DIAMOND?

Alternatively, the best option as @swarbred suggested might be to obtain the output as tabular, with the options he indicated:

--outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop

lukesarre commented 3 years ago

Hi @lucventurini , Yep! Installing DIAMOND v2.0.7 seemed to fix the issue. Thank you very much for your help! All the best, Luke

lukesarre commented 3 years ago

Hi @lucventurini ,

I don't know if this is a related issue, but when I continue with the pipeline I am again running into blast-related issues. When looking at the mikado.subloci.metrics.tsv file it looks as though the blast was successful, as the blast-related values for many (although not most) of the transcripts is not '0'.

Curiously however, in the mikado.loci.metrics.tsv file all of the blast-related values are '0', and searching for individual 'aliases' shows that these tids did have blast scores in the subloci stage. Do you know why the blast information might not be carried over?

I will try using a tabular output of DIAMOND, although let me know if you have another approach I should try.

Thank you, Luke

lucventurini commented 3 years ago

Hi @lukesarre Thank you for pointing this out, we will investigate! I currently do not know why that would be the case.

Best, Luca

lucventurini commented 3 years ago

@lukesarre @ljyanesm

Confirmed in our test dataset. That's good: we can verify the fix and later add tests to prevent regressions. Thanks again @lukesarre for pointing this out.

lukesarre commented 3 years ago

Awesome, I appreciate the work you guys do on this software! To confirm, I run into the same issue when using the recommended .tsv diamond output. All the best, Luke

lucventurini commented 3 years ago

Dear @lukesarre,

We have fixed the issue in master. We will be releasing a new patch release in the next weeks; in the meantime, you can download directly from master.

I want to emphasize that the results from v2.0.2 are correct, just the reporting of the metrics/scores is not.