EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
94 stars 18 forks source link

serialise fails to load blast dbase .. can't find entries ... dictionary value error issue #392

Closed adamfreedman closed 3 years ago

adamfreedman commented 3 years ago

running the latest mikado using similar cmds to what i used in 2020 with success ...

the cmd: mikado serialise --json-conf configuration.yaml --xml blastx/mikado.blastx.xml.cocnat_2021.03.23.xml.gz --orfs transdecoder/mikado_prepared.fasta.transdecoder.bed --blast_targets xtrop_xlaevis_nparkeri_lcatesbeianus_protein.faa

stderror: Mikado crashed, cause: ref|XP_018411542.1| not found (Accession: {'_id': None, '_id_alt': [], '_query_id': None, '_description': 'PREDICTED: ras association domain-containing protein 7 [Nanorana parkeri]', '_description_alt': [], '_query_description': '', 'attributes': {}, 'dbxrefs': [], '_items': [HSP(hit_id='ref|XP_018411542.1|', query_id='scallop_TU12746', 1 fragments)], 'blast_id': 'ref|XP_018411542.1|', 'accession': 'XP_018411542', 'seq_len': 431}) Traceback (most recent call last): File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/main.py", line 68, in main args.func(args) File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 378, in serialise load_blast(args, logger) File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 125, in load_blast part_launcher(filenames) File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 53, in xml_launcher xml_serializer() File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 360, in call self.serialize() File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 342, in serialize self.serialise_xmls() File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 351, in serialise_xmls _serialise_xmls(self) File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 124, in _serialise_xmls max_target_seqs=self._max_target_seqs, logger=self.logger, off_by_one=off_by_one) File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 224, in objectify_record current_target, cache["target"] = _get_target_for_blast(alignment, cache["target"]) File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_utils.py", line 89, in _get_target_for_blast raise ValueError("{} not found (Accession: {})".format(alignment.id, alignment.dict)) ValueError: ref|XP_018411542.1| not found (Accession: {'_id': None, '_id_alt': [], '_query_id': None, '_description': 'PREDICTED: ras association domain-containing protein 7 [Nanorana parkeri]', '_description_alt': [], '_query_description': '', 'attributes': {}, 'dbxrefs': [], '_items': [HSP(hit_id='ref|XP_018411542.1|', query_id='scallop_TU12746', 1 fragments)], 'blast_id': 'ref|XP_018411542.1|', 'accession': 'XP_018411542', 'seq_len': 431})

wyim-pgl commented 3 years ago

Hi! It may be helpful for you to explain how you ran BLAST.

adamfreedman commented 3 years ago

I split the mikado prepared.fasta into chunks for running on a job array with slurm, with each cmd being as such:

blastx -max_target_seqs 5 -num_threads 10 -query mikado_prepared."${SLURM_ARRAY_TASK_ID}".fasta -outfmt 5 -db ../xtrop_xlaevis_nparkeri_lcatesbeianus_protein.faa -evalue 0.000001 2> blast.${SLURM_ARRAY_TASK_ID}.log | sed '/^$/d' | gzip -c - > mikado.${SLURM_ARRAY_TASK_ID}.blast.xml.gz

${SLURM_ARRAY_TASK_ID}" is just the numbered subfile, e.g. mikado_prepared.1.fasta, mikado_prepared.2.fasta, etc.

I then zcat all the array outputs into one blast.xml file, then gzip that file

fwiw, this is exactly how i've run blastx previously for use with mikado

-Adam

Adam H. Freedman, PhD Data Scientist Faculty of Arts & Sciences Informatics Group Harvard University 38 Oxford St Cambridge, MA 02138 phone: +001 310 415 7145


From: Won Cheol Yim @.> Sent: Tuesday, March 23, 2021 3:21 PM To: EI-CoreBioinformatics/mikado @.> Cc: Freedman, Adam @.>; Author @.> Subject: Re: [EI-CoreBioinformatics/mikado] serialise fails to load blast dbase .. can't find entries ... dictionary value error issue (#392)

Hi! It may be helpful for you to explain how you ran BLAST.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_EI-2DCoreBioinformatics_mikado_issues_392-23issuecomment-2D805168525&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=EHxwSGBF6spYhfafHFV_AbQ_iIgyXgduXu1tt3tPhgQ&s=y86wcJ85yxpC4_K24K0vHGZ1zn_obqomJv-MY39CQ7w&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADBMMUCYVPN27TVIF5E7IJ3TFDS4FANCNFSM4ZV2ACRQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=EHxwSGBF6spYhfafHFV_AbQ_iIgyXgduXu1tt3tPhgQ&s=uysXEujp3217KC0IDb4MccrQ3h8wx6WKUdNxvMjGcjs&e=.

wyim-pgl commented 3 years ago

How about makeblastdb? it might need to have -parse_seqids. BTW, you can do cat *.gz >> output.gz instead of zcat and gzip

adamfreedman commented 3 years ago

pretty sure I used -parse_seqids in the blast cmd, but I only have the stdout log for mkblastdb, but, i grabbed the makeblastdb cmd from the mikado readthedocs page which includes that switch

Adam H. Freedman, PhD Data Scientist Faculty of Arts & Sciences Informatics Group Harvard University 38 Oxford St Cambridge, MA 02138 phone: +001 310 415 7145


From: Won Cheol Yim @.> Sent: Tuesday, March 23, 2021 3:32 PM To: EI-CoreBioinformatics/mikado @.> Cc: Freedman, Adam @.>; Author @.> Subject: Re: [EI-CoreBioinformatics/mikado] serialise fails to load blast dbase .. can't find entries ... dictionary value error issue (#392)

How about makeblastdb? it might need to have -parse_seqids. BTW, you can do cat *.gz >> output.gz instead of zcat and gzip

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_EI-2DCoreBioinformatics_mikado_issues_392-23issuecomment-2D805175250&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=vp6d1t77zacrJzkVeakMw7CADOSGtJCH571yOX7cyWM&s=zwjlBDXEsmOeEaT7zQ_xgkSF6dFPuU0pc2YPySTrnKM&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADBMMUBAMKG5LFUWPUYAL43TFDUEFANCNFSM4ZV2ACRQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=vp6d1t77zacrJzkVeakMw7CADOSGtJCH571yOX7cyWM&s=xkJlbX5qnrhtBUJ29BariRipx7afs0RrXbrX2wFkkx8&e=.

lucventurini commented 3 years ago

Dear @adamfreedman

Thank you for reporting this, and thank you to @wyim-pgl for helping out!

I fear @ljyanesm and I might have introduced a bug in the parsing of the reference sequence in the latest release, I know we touched the relevant regular expression. Would you please be able to send us a minimal example here (e.g. some ten sequences on the blast database and ten from the mikado_prepared.fasta file, that you know do get aligned together) so that we can test this?

As another note, @ljyanesm and I have recently moved Mikado away from using XML files as the default for BLAST, please see the documentation here: https://mikado.readthedocs.io/en/stable/Usage/Serialise/?highlight=tabular#blast-files

I am in the process of revising the documentation and I will make sure to update the tutorial if it is out of sync with this change.

It might very well be that the bug you encountered will affect the tabular format as well. Regardless, we would appreciate if you could send us a test file so that we can diagnose and solve the issue as soon as possible.

Kind regards,

adamfreedman commented 3 years ago

here are fasta files of queries and targets for which the former hit the latter with blastx testqueries.fasta.gz testtargets.fasta.gz

lucventurini commented 3 years ago

Dear @adamfreedman

@ljyanesm and I identified the cause, it was indeed linked to the regular expression. Briefly, Mikado was malfunctioning when using the parse_seqids during database construction with NCBI BLAST+.

We have fixed the code and I am currently implementing the tests. We will be releasing a new version (2.2.3) later today UK time I hope.

Kind regards,

lucventurini commented 3 years ago

Dear @adamfreedman

We have fixed this in 69e45a4. I am about to release to PyPI and Conda.

Kind regards,

adamfreedman commented 3 years ago

The fix may have added or exposed other bugs.

serialise threw an exception that suggested something wrong with the config file, so i re-ran configure. I assumed that the files created with the previous version of prepare would not create an issue. Upon running serialise again, I got: Mikado crashed, cause: junk after document element: line 53663, column 0 Traceback (most recent call last): File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/main.py", line 68, in main args.func(args) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 384, in serialise load_blast(mikado_configuration, logger) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 159, in load_blast part_launcher(filenames) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 87, in xml_launcher xml_serializer() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 377, in call self.serialize() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 349, in serialize self.serialise_xmls() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 358, in serialise_xmls _serialise_xmls(self) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 111, in _serialise_xmls for record in opened: File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/parsers/blast_utils.py", line 103, in next return next(iter(self.parser)) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Bio/SearchIO/init.py", line 306, in parse yield from generator File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Bio/SearchIO/BlastIO/blast_xml.py", line 240, in iter yield from self._parse_qresult() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Bio/SearchIO/BlastIO/blast_xml.py", line 289, in _parse_qresult for event, qresult_elem in self.xml_iter: File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/xml/etree/ElementTree.py", line 1222, in iterator yield from pullparser.read_events() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/xml/etree/ElementTree.py", line 1297, in read_events raise event File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/xml/etree/ElementTree.py", line 1269, in feed self._parser.feed(data) File "", line None xml.etree.ElementTree.ParseError: junk after document element: line 53663, column 0

-Adam

Adam H. Freedman, PhD Data Scientist Faculty of Arts & Sciences Informatics Group Harvard University 38 Oxford St Cambridge, MA 02138 phone: +001 310 415 7145


From: Luca Venturini @.> Sent: Wednesday, March 24, 2021 2:57 PM To: EI-CoreBioinformatics/mikado @.> Cc: Freedman, Adam @.>; Mention @.> Subject: Re: [EI-CoreBioinformatics/mikado] serialise fails to load blast dbase .. can't find entries ... dictionary value error issue (#392)

Dear @adamfreedmanhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_adamfreedman&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=0oMa4OSxUDi4kNV6oO0eLLB18Kca8dxmazhQ-QFUDOg&s=eHq4n-8ygrOzk5__oVDY89qSH74fHzwlTWeCZIfmtI8&e=

We have fixed this in 69e45a4https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_EI-2DCoreBioinformatics_mikado_commit_69e45a464c90b85985475c9ba18c8245c30fda4a&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=0oMa4OSxUDi4kNV6oO0eLLB18Kca8dxmazhQ-QFUDOg&s=VaZysm-gKaXxxgU0PJKZ7HPvGgK2Axg5BbrA4U66WLM&e=. I am about to release to PyPI and Conda.

Kind regards,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_EI-2DCoreBioinformatics_mikado_issues_392-23issuecomment-2D806076387&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=0oMa4OSxUDi4kNV6oO0eLLB18Kca8dxmazhQ-QFUDOg&s=mJBmx7kHm6YLInJ6F5XXAoihmIIOM2kkWotGfb2iYOk&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADBMMUGD4T4H3A76VSBC6J3TFIY2HANCNFSM4ZV2ACRQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=0oMa4OSxUDi4kNV6oO0eLLB18Kca8dxmazhQ-QFUDOg&s=xGGujlFf9USUmog5kJAl8_FVJjMpX7XaiCMg-oxL7G4&e=.

adamfreedman commented 3 years ago

and for what it's worth, this was done running on an updated install from conda.

Adam H. Freedman, PhD Data Scientist Faculty of Arts & Sciences Informatics Group Harvard University 38 Oxford St Cambridge, MA 02138 phone: +001 310 415 7145


From: Freedman, Adam @.> Sent: Thursday, March 25, 2021 12:26 PM To: EI-CoreBioinformatics/mikado @.>; EI-CoreBioinformatics/mikado @.> Cc: Mention @.> Subject: Re: [EI-CoreBioinformatics/mikado] serialise fails to load blast dbase .. can't find entries ... dictionary value error issue (#392)

The fix may have added or exposed other bugs.

serialise threw an exception that suggested something wrong with the config file, so i re-ran configure. I assumed that the files created with the previous version of prepare would not create an issue. Upon running serialise again, I got: Mikado crashed, cause: junk after document element: line 53663, column 0 Traceback (most recent call last): File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/main.py", line 68, in main args.func(args) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 384, in serialise load_blast(mikado_configuration, logger) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 159, in load_blast part_launcher(filenames) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 87, in xml_launcher xml_serializer() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 377, in call self.serialize() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 349, in serialize self.serialise_xmls() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 358, in serialise_xmls _serialise_xmls(self) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 111, in _serialise_xmls for record in opened: File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Mikado/parsers/blast_utils.py", line 103, in next return next(iter(self.parser)) File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Bio/SearchIO/init.py", line 306, in parse yield from generator File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Bio/SearchIO/BlastIO/blast_xml.py", line 240, in iter yield from self._parse_qresult() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/site-packages/Bio/SearchIO/BlastIO/blast_xml.py", line 289, in _parse_qresult for event, qresult_elem in self.xml_iter: File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/xml/etree/ElementTree.py", line 1222, in iterator yield from pullparser.read_events() File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/xml/etree/ElementTree.py", line 1297, in read_events raise event File "/n/home_rc/afreedman/.conda/envs/mikado2.2.3/lib/python3.7/xml/etree/ElementTree.py", line 1269, in feed self._parser.feed(data) File "", line None xml.etree.ElementTree.ParseError: junk after document element: line 53663, column 0

-Adam

Adam H. Freedman, PhD Data Scientist Faculty of Arts & Sciences Informatics Group Harvard University 38 Oxford St Cambridge, MA 02138 phone: +001 310 415 7145


From: Luca Venturini @.> Sent: Wednesday, March 24, 2021 2:57 PM To: EI-CoreBioinformatics/mikado @.> Cc: Freedman, Adam @.>; Mention @.> Subject: Re: [EI-CoreBioinformatics/mikado] serialise fails to load blast dbase .. can't find entries ... dictionary value error issue (#392)

Dear @adamfreedmanhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_adamfreedman&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=0oMa4OSxUDi4kNV6oO0eLLB18Kca8dxmazhQ-QFUDOg&s=eHq4n-8ygrOzk5__oVDY89qSH74fHzwlTWeCZIfmtI8&e=

We have fixed this in 69e45a4https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_EI-2DCoreBioinformatics_mikado_commit_69e45a464c90b85985475c9ba18c8245c30fda4a&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=0oMa4OSxUDi4kNV6oO0eLLB18Kca8dxmazhQ-QFUDOg&s=VaZysm-gKaXxxgU0PJKZ7HPvGgK2Axg5BbrA4U66WLM&e=. I am about to release to PyPI and Conda.

Kind regards,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_EI-2DCoreBioinformatics_mikado_issues_392-23issuecomment-2D806076387&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=0oMa4OSxUDi4kNV6oO0eLLB18Kca8dxmazhQ-QFUDOg&s=mJBmx7kHm6YLInJ6F5XXAoihmIIOM2kkWotGfb2iYOk&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADBMMUGD4T4H3A76VSBC6J3TFIY2HANCNFSM4ZV2ACRQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=0oMa4OSxUDi4kNV6oO0eLLB18Kca8dxmazhQ-QFUDOg&s=xGGujlFf9USUmog5kJAl8_FVJjMpX7XaiCMg-oxL7G4&e=.

lucventurini commented 3 years ago

Dear @adamfreedman ,

Thank you for the update. May I suggest inspecting the XML files passed to serialise though? I strongly suspect that one or more might be truncated.

I am asking this because the traceback indicates that the error was triggered in the BioPython code for parsing XML files, which itself was triggered by what seems an unexpected truncation of the document at line 53663.

Admittedly the Mikado code could handle this better and better inform the user of what has happened, and in which file. This is something we can try to improve on.

In case you indeed need to regenerate the BLAST files, I would like again to point out that the new Mikado versions can load data faster by using the tabular format rather than XML, with custom fields.

Many thanks for your patience and feedback.

adamfreedman commented 3 years ago

yeah ... it looks like i did something wrong with the job array output concatenation. no record of what i did, but seems i was holding something wrong.

blast has already been done but in the future i'll just output in tabular format, per your suggestion.

thanks, Adam

Adam H. Freedman, PhD Data Scientist Faculty of Arts & Sciences Informatics Group Harvard University 38 Oxford St Cambridge, MA 02138 phone: +001 310 415 7145


From: Luca Venturini @.> Sent: Thursday, March 25, 2021 1:08 PM To: EI-CoreBioinformatics/mikado @.> Cc: Freedman, Adam @.>; Mention @.> Subject: Re: [EI-CoreBioinformatics/mikado] serialise fails to load blast dbase .. can't find entries ... dictionary value error issue (#392)

Dear @adamfreedmanhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_adamfreedman&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=SeN3JYWxPojAiVTNVOofvWvlboih37wsTnmQsG_n5NY&s=_eTx9tzoY_qJbVAdDzEcerIm-os4lrIa2SSU2_t2BcI&e= ,

Thank you for the update. May I suggest inspecting the XML files passed to serialise though? I strongly suspect that one or more might be truncated.

I am asking this because the traceback indicates that the error was triggered in the BioPython code for parsing XML files, which itself was triggered by what seems an unexpected truncation of the document at line 53663.

Admittedly the Mikado code could handle this better and better inform the user of what has happened, and in which file. This is something we can try to improve on.

In case you indeed need to regenerate the BLAST files, I would like again to point out that the new Mikado versions can load data faster by using the tabular format rather than XML, with custom fields.

Many thanks for your patience and feedback.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_EI-2DCoreBioinformatics_mikado_issues_392-23issuecomment-2D807111590&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=SeN3JYWxPojAiVTNVOofvWvlboih37wsTnmQsG_n5NY&s=TtPzEWbO2A5xh34As5eke2IADsTsxIrz4Kn63m9iLNw&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADBMMUHNLM44EOHKVA5IS6LTFNUYTANCNFSM4ZV2ACRQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=MITI_LEJgyr1a24IMFAlSaZIPxMpOUT21T7L3fg4CjA&m=SeN3JYWxPojAiVTNVOofvWvlboih37wsTnmQsG_n5NY&s=Nx32s3_bmWKWVd6UVyHLEao3Zd-mjf6A-95ErCp5hpc&e=.

lucventurini commented 3 years ago

Dear @adamfreedman

Thank you again for the update. I hope that this time Mikado will run more smoothly. Please let us know if you encounter any other issue.

Many thanks, Luca Venturini