EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
97 stars 18 forks source link

error during mikado config #297

Closed Ruth-hals closed 3 years ago

Ruth-hals commented 4 years ago

Hi, I am not able to figure out why my filename of my gff3 file keeps giving an error:

/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/configuration/configurator.py:529: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. scoring = yaml.load(scoring_file) 2020-03-10 11:53:30,753 - main - init.py:124 - ERROR - main - MainProcess - Mikado crashed, cause: 2020-03-10 11:53:30,753 - main - init.py:125 - ERROR - main - MainProcess - Malformed inputs file. Error: Invalid file name: /data/leuven/329/vsc32990/mikadoinput/petrosino.gff3 Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/subprograms/configure.py", line 210, in create_config raise ValueError("Invalid file name: {}".format(filename)) ValueError: Invalid file name: /data/leuven/329/vsc32990/mikadoinput/petrosino.gff3

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/init.py", line 110, in main args.func(args) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/subprograms/configure.py", line 222, in create_config raise ValueError("Malformed inputs file. Error:\n{}".format(exc)) ValueError: Malformed inputs file. Error: Invalid file name: /data/leuven/329/vsc32990/mikadoinput/petrosino.gff3 ~ ~ ~ If I remove this gff file from my configuration file i get the following error:

/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/configuration/configurator.py:529: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. scoring = yaml.load(scoring_file) 2020-03-10 11:57:30,301 - main - init.py:124 - ERROR - main - MainProcess - Mikado crashed, cause: 2020-03-10 11:57:30,301 - main - init.py:125 - ERROR - main - MainProcess - Malformed inputs file. Error: not enough values to unpack (expected 3, got 1) Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/subprograms/configure.py", line 207, in create_config filename, label, stranded = _fields[:3] ValueError: not enough values to unpack (expected 3, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/init.py", line 110, in main args.func(args) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/subprograms/configure.py", line 222, in create_config raise ValueError("Malformed inputs file. Error:\n{}".format(exc)) ValueError: Malformed inputs file. Error: not enough values to unpack (expected 3, got 1)

Any help would be appreciated!

Ruth-hals commented 4 years ago

Hi, I solved this issue by creating a new gff file in the gene format not EST_match and then mikado was able to parse them.

I am experiencing another error though which I think might be related to error #265... After running mikado serialise the mikado.db database is created and seems allright but I get the following errors:

serialize.log: 2020-04-03 10:09:16,969 - serialiser - serialise.py:268 - INFO - setup - MainProcess - Command line: /data/leuven/329/vsc32990/miniconda3/envs/mikado/bin/mikado serialise --json-conf configuration.yaml --xml mikado.blast.xml.gz --orfs mikado.bed --blast_targets swissprotmusmusculus.fasta 2020-04-03 10:09:28,465 - serialiser - dbutils.py:54 - WARNING - create_connector - MainProcess - No database found, creating a mock one!

And the stderr: /data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/configuration/configurator.py:529: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. scoring = yaml.load(scoring_file) 2020-04-03 10:10:22,891 - main - init.py:124 - ERROR - main - MainProcess - Mikado crashed, cause: 2020-04-03 10:10:22,891 - main - init.py:125 - ERROR - main - MainProcess - No transaction is begun. Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/init.py", line 110, in main args.func(args) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/subprograms/serialise.py", line 323, in serialise load_orfs(args, logger) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/subprograms/serialise.py", line 144, in load_orfs serializer() File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/serializers/orf.py", line 314, in call self.serialize() File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/serializers/orf.py", line 288, in serialize self.session.commit() File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1034, in commit raise sa_exc.InvalidRequestError("No transaction is begun.") sqlalchemy.exc.InvalidRequestError: No transaction is begun.

Although the db file is created, mikado pick is not working so I think something is going wrong in the serialize stage. I would welcome any suggestions for solving this...

lucventurini commented 4 years ago

Dear @Ruth-hals

Many apologies for the delayed response. Your problem might due to the fact that mikado serialise is trying to initialise the database in the wrong place. Would you please be able to rerun mikado serialise adding the flag -od {output folder} and see whether this could solve the problem?

I have been making a lot of changes to serialise (see #280) and I am waiting for tests to be completed before uploading the new version to conda. I am fairly confident that the bug will not be present in that version.

Ruth-hals commented 4 years ago

Dear @lucventurini,

Thanks for replying! Specified the output folder and it seems the log file was ok but I am still getting this stderr:

2020-04-03 15:44:39,032 - main - init.py:124 - ERROR - main - MainProcess - Mikado crashed, cause: 2020-04-03 15:44:39,033 - main - init.py:125 - ERROR - main - MainProcess - No transaction is begun. Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/init.py", line 110, in main args.func(args) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/subprograms/serialise.py", line 323, in serialise load_orfs(args, logger) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/subprograms/serialise.py", line 144, in load_orfs serializer() File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/serializers/orf.py", line 314, in call self.serialize() File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/serializers/orf.py", line 288, in serialize self.session.commit() File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1034, in commit raise sa_exc.InvalidRequestError("No transaction is begun.") sqlalchemy.exc.InvalidRequestError: No transaction is begun. ~ ~

======> mikado.db looks ok but after trying mikado pick this is my log file:

2020-04-03 15:50:04,571 - main_logger - picker.py:320 - INFO - setup_logger - MainProcess - Begun analysis of mikado_prepared.gtf 2020-04-03 15:50:04,571 - main_logger - picker.py:322 - INFO - setup_logger - MainProcess - Command line: /data/leuven/329/vsc32990/miniconda3/envs/mikado/bin/mikado pick --json-conf configuration.yaml -db mikado.db --subloci_out mikado.subloci.gff3 2020-04-03 15:50:04,571 - main_logger - picker.py:236 - INFO - setup_shm_db - MainProcess - Copy into a SHM db: False 2020-04-03 15:50:04,572 - listener - picker.py:339 - WARNING - setup_logger - MainProcess - Current level for queue: WARNING

and the stderr:

/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/configuration/configurator.py:529: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. scoring = yaml.load(scoring_file) 2020-04-03 15:50:04,912 - main - init.py:124 - ERROR - main - MainProcess - Mikado crashed, cause: 2020-04-03 15:50:04,912 - main - init.py:125 - ERROR - main - MainProcess - 'DiGraph' object has no attribute 'add_path' Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/init.py", line 110, in main args.func(args) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/subprograms/pick.py", line 152, in pick creator() File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/picking/picker.py", line 1156, in call self._parse_and_submit_input(data_dict) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/picking/picker.py", line 1129, in _parse_and_submit_input self.__submit_single_threaded(data_dict) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/picking/picker.py", line 1053, in submit_single_threaded source=self.json_conf["pick"]["output_format"]["source"]) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/loci/superlocus.py", line 149, in init__ super().add_transcript_to_locus(transcript_instance) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/loci/abstractlocus.py", line 440, in add_transcript_to_locus self.add_path_to_graph(transcript, self._internal_graph) File "/data/leuven/329/vsc32990/miniconda3/envs/mikado/lib/python3.6/site-packages/Mikado/loci/abstractlocus.py", line 1154, in add_path_to_graph graph.add_path(segments) AttributeError: 'DiGraph' object has no attribute 'add_path'

lucventurini commented 4 years ago

Dear @Ruth-hals

  1. May I ask you which version of Mikado you are using? I gather the one from conda?
  2. the problem with pick is an incompatibility problem with the networkx library. I will be necessary to downgrade the library to a version < 2 for mikado to function.
  3. I think from the stderr that the mikado.db will unfortunately be empty. Would you please be able to see whether the following tables are populated:
    $ sqlite3 mikado.db
    > select count(*) from hit;
    > select count(*) from orf;
    > select count(*) from query;

Kind regards

Ruth-hals commented 4 years ago

Dear @lucventurini,

Yes I'm using the version from conda. So i checked the content for the db: it seems serialize didn't integrate my blast or transdecoder info. Do you have an idea why this could be? For blasting and translating I concatenated the different fastas into one. Might this be the problem?

sqlite> select count () from hit; 0 sqlite> select count () from orf; 0 sqlite> select count (*) from query; 634113

Kind regards, Ruth

lucventurini commented 4 years ago

Dear @Ruth-hals

This should now be fixed in version 2.0rc2 on both PyPI and BioConda.

If you could please install the latest version and let us know whether it works for you, we would be grateful.

Kind regards

Ruth-hals commented 4 years ago

Dear @lucventurini, I downloaded the new version and went through the configure and prepare steps without problems - but now it seems stuck on serialize. Even though my job is running, there are no stdout,stderr,log files created so I'm not sure where it is going wrong. It seems the job freezes after 24 sec. Any idea what might be the problem? Kind regards, Ruth

lucventurini commented 4 years ago

Dear @Ruth-hals

put like this, it's a bit hard to diagnose .... would you be able though to try to launch without providing the ORFs file, please? That is the first big operation. Knowing whether it gets stuck there, or before, coud help me pinpoint the problem.

Kind regards

Ruth-hals commented 4 years ago

Dear @lucventurini Leaving out the ORFs file lets the job finish within 1 min and does give me the serialize.log:

2020-04-17 10:40:44,818 - serialiser - serialise.py:290 - INFO - setup - MainProcess - Command line: /data/leuven/329/vsc32990/miniconda3/envs/Mikado/bin/mikado serialise --json-conf configuration.yaml --xml mikado.blast.xml.gz --blast_targets swissprotmusmusculus.fasta -od /staging/leuven/stg_00056/octopus/mikado/mikado2/output/ 2020-04-17 10:40:44,818 - serialiser - serialise.py:296 - INFO - setup - MainProcess - Random seed: 0 2020-04-17 10:40:44,818 - serialiser - serialise.py:334 - INFO - setup - MainProcess - Using a sqlite database (location: /staging/leuven/stg_00056/octopus/mikado/mikado2/output/mikado.db) 2020-04-17 10:40:44,818 - serialiser - serialise.py:338 - INFO - setup - MainProcess - Requested 4 threads, forcing single thread: False 2020-04-17 10:40:44,818 - serialiser - serialise.py:154 - INFO - load_orfs - MainProcess - No ORF data provided, skipping 2020-04-17 10:40:44,818 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data 2020-04-17 10:40:44,820 - serialiser - blast_serialiser.py:78 - INFO - init - MainProcess - Number of dedicated workers: 4 2020-04-17 10:40:45,738 - serialiser - blast_serialiser.py:244 - INFO - serialize_targets - MainProcess - Started to serialise the targets 2020-04-17 10:40:45,827 - serialiser - blast_serialiser.py:279 - INFO - serialize_targets - MainProcess - Loaded 17033 objects into the "target" table 2020-04-17 10:40:45,828 - serialiser - blast_serialiser.py:173 - INFO - serialize_queries - MainProcess - Started to serialise the queries 2020-04-17 10:41:20,898 - serialiser - blast_serialiser.py:221 - INFO - serialize_queries - MainProcess - Loaded 610644 objects into the "query" table 2020-04-17 10:42:51,220 - serialiser - xml_serialiser.py:214 - INFO - _serialise_xmls - MainProcess - Loaded 0 alignments for 0 queries 2020-04-17 10:42:51,221 - serialiser - xml_serialiser.py:216 - INFO - _serialise_xmls - MainProcess - Finished loading blast hits 2020-04-17 10:42:51,297 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data 2020-04-17 10:42:51,297 - serialiser - serialise.py:71 - INFO - load_junctions - MainProcess - Skipping junction loading as no junctions have been provided. 2020-04-17 10:42:51,297 - serialiser - serialise.py:383 - INFO - serialise - MainProcess - Finished

a mikado.db file is also made: sqlite> select count () from query; 610644 sqlite> select count () from orf; 0 sqlite> select count () from target; 17033 sqlite> select count () from hit; 0

the stderr file is still empty, and a tmpr1dugpm3.db file is also created: .tables dump

=> so it seems that the blast results are not integrated either since hit is empty?

lucventurini commented 4 years ago

Dear @Ruth-hals

OK, at least we know now that one of the problems is with the ORFs. Would you please be able to attach/paste here:

I will try to have a look as quickly as possible.

Ruth-hals commented 4 years ago

Dear @lucventurini,

I used a different scoring file in this new version and the amount of transcripts in mikado_prepared.fasta differs slightly from before, I'm reblasting and translating the sequences and will try it again.

In the mean time here are the details: The command line used: mikado serialise --json-conf configuration.yaml --xml mikado.blast.xml.gz --orfs mikado.bed --blast_targets swissprotmusmusculus.fasta -od /staging/leuven/stg_00056/octopus/mikado/mikado2/out/

The size of the input files: 22611540 mikado.bed (ORFS) 46399488 mikado.blast.xml.gz

The config file is attached. I wasn't able to use a txt file to specify my gffs, so I used the configure command to adjust the configuration file. The only thing I haven't been able to include is specific scoring for each gff. I will keep you updated, thanks for helping! Kind regards, Ruth configuration.txt

lucventurini commented 4 years ago

Dear @Ruth-hals

As you are re-blasting now: please stop that BLAST and instead re-blast using the following options:

-outfmt 6 "qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop"

This will produce a tabular output which mikado can parse much more quickly, it's a new feature that we introduced for the RC.

lucventurini commented 4 years ago

The config file is attached. I wasn't able to use a txt file to specify my gffs, so I used the configure command to adjust the configuration file. The only thing I haven't been able to include is specific scoring for each gff.

This is unfortunate. May I ask what was the error that prevented you from using a text file to configure mikado?

Also, looking at the configuration file: it's a long shot, but could you add to the command line the parameter:

-mo 100000

This will tell mikado to keep less stuff in memory before flushing to the database. If the issue is there (ie it's keeping a lot of stuff in memory), it should help solving it.

Ruth-hals commented 4 years ago

Dear @lucventurini,

Thanks for your suggestions. I tried again starting from configure and i was able to use the text file. Now with using the new blast and orfs I was able to generate a database.

sqlite> .tables chrom hit orf external hsp query external_sources junctions target sqlite> select count() from hit; 0 sqlite> select count() from orf; 93056 sqlite> select count() from query; 610646 sqlite> select count() from target; 17033

So the ORFs now seem fine - I think the problem was the different fasta files. I don't know why the blast results are not being integrated? As you suggested I blasted with the following command:

blastx -max_target_seqs 5 -num_threads 10 -query mikado_prepared.fasta -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop" -db swissprotmusmusculus.fasta -evalue 0.000001 2> blast.log | sed '/^$/d' | gzip -c - > mikado.blast.xml.gz

The blast output file seems fine. The stderr and the blast log are empty.

lucventurini commented 4 years ago

@Ruth-hals

Could you change the name of mikado.blast.xml.gz to mikado.blast.tsv.gz, delete the database, and relaunch? I fear that currently mikado will naively expect a file called .xml to be, well, an XML file, which will break the parsing.

Kind regards

Ruth-hals commented 4 years ago

I changed it to tsv... Log file 2020-04-17 17:28:08,304 - serialiser - serialise.py:290 - INFO - setup - MainProcess - Command line: /data/leuven/329/vsc32990/miniconda3/envs/Mikado/bin/mikado serialise --json-conf configuration.yaml --tsv mikado.blast.tsv.gz --orfs mikado.bed --blast_targets swissprotmusmusculus.fasta -od /staging/leuven/stg_00056/octopus/mikado/mikado3/outputserialize/ 2020-04-17 17:28:08,304 - serialiser - serialise.py:296 - INFO - setup - MainProcess - Random seed: 0 2020-04-17 17:28:08,304 - serialiser - serialise.py:334 - INFO - setup - MainProcess - Using a sqlite database (location: /staging/leuven/stg_00056/octopus/mikado/mikado3/outputserialize/mikado.db) 2020-04-17 17:28:08,304 - serialiser - serialise.py:338 - INFO - setup - MainProcess - Requested 4 threads, forcing single thread: False 2020-04-17 17:28:08,304 - serialiser - serialise.py:140 - INFO - load_orfs - MainProcess - Starting to load ORF data 2020-04-17 17:28:45,434 - Bed12ParseWrapper-2 - bed12.py:1704 - INFO - run - Bed12ParseWrapper-2 - Started 0 2020-04-17 17:28:45,438 - Bed12ParseWrapper-3 - bed12.py:1704 - INFO - run - Bed12ParseWrapper-3 - Started 1 2020-04-17 17:28:45,440 - Bed12ParseWrapper-4 - bed12.py:1704 - INFO - run - Bed12ParseWrapper-4 - Started 2 2020-04-17 17:28:45,443 - Bed12ParseWrapper-5 - bed12.py:1704 - INFO - run - Bed12ParseWrapper-5 - Started 3 2020-04-17 17:28:58,475 - serialiser - orf.py:477 - INFO - serialize_multiple_threads - MainProcess - Finished loading 93056 ORFs into the database 2020-04-17 17:28:59,041 - serialiser - serialise.py:152 - INFO - load_orfs - MainProcess - Finished loading ORF data 2020-04-17 17:28:59,153 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data 2020-04-17 17:28:59,155 - serialiser - blast_serialiser.py:78 - INFO - init - MainProcess - Number of dedicated workers: 4 2020-04-17 17:29:08,456 - serialiser - blast_serialiser.py:244 - INFO - serialize_targets - MainProcess - Started to serialise the targets 2020-04-17 17:29:08,519 - serialiser - blast_serialiser.py:279 - INFO - serialize_targets - MainProcess - Loaded 17033 objects into the "target" table 2020-04-17 17:29:08,519 - serialiser - blast_serialiser.py:173 - INFO - serialize_queries - MainProcess - Started to serialise the queries 2020-04-17 17:29:08,709 - serialiser - blast_serialiser.py:221 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table 2020-04-17 17:29:08,751 - serialiser - tab_serialiser.py:30 - INFO - _serialise_tabular - MainProcess - Creating a pool with 4 workers for analysing BLAST results 2020-04-17 17:29:09,958 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado.blast.tsv.gz data ~ ~

stderr: 2020-04-17 17:29:17,312 - main - init.py:120 - ERROR - main - MainProcess - Mikado crashed, cause: 2020-04-17 17:29:17,313 - main - init.py:121 - ERROR - main - MainProcess - ('slength', 766518, 766518) Traceback (most recent call last): File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/init.py", line 106, in main args.func(args) File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 380, in serialise load_blast(args, logger) File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 124, in load_blast part_launcher(filenames) File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 55, in xml_launcher xml_serializer() File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 362, in call self.serialize() File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 346, in serialize self.__serialise_tabular() File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 350, in __serialise_tabular _serialise_tabular(self) File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/tab_serialiser.py", line 41, in _serialise_tabular parser(bname=fname) File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/tabular_utils.py", line 365, in parse_tab_blast data = sanitize_blast_data(data, queries, targets, qmult=qmult, tmult=tmult) File "/data/leuven/329/vsc32990/miniconda3/envs/Mikado/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/tabular_utils.py", line 233, in sanitize_blast_data assert ~(data[col].isna().any()), (col, data[data[col].isna()].shape[0], data.shape[0]) AssertionError: ('slength', 766518, 766518)

lucventurini commented 4 years ago

Hi @Ruth-hals So before, as I imagined, mikado rejected the BLAST file because it was not an XML.

With the tabular data the problem is different .... I think there is a join problem, ie the data relative to the BLAST targets is not linked correctly. I am very sorry to ask for another check, but would you please be able to paste here some lines of the "sseqid" column in the BLAST file, as well as some of the headers from the BLAST targets?

My suspicion is that the targets have two IDs per header, and BLAST has used one while Mikado has used the other. E.g.

>Q6V4H0 RecName: Full=8-hydroxygeraniol dehydrogenase; Short=Cr10HGO; EC=1.1.1.324;

In the scenario I fear, Mikado would use Q6V4H0 while BLAST would report Cr10HGO.

It's strange because I routinely test Mikado using swissprot data, but it seems the clearest explanation.

Ruth-hals commented 4 years ago

Target data:

sp|Q8CJ12|AGRG2_MOUSE Adhesion G-protein coupled receptor G2 OS=Mus musculus OX=10090 GN=Adgrg2 PE=1 SV=1 sp|Q01815|CAC1C_MOUSE Voltage-dependent L-type calcium channel subunit alpha-1C OS=Mus musculus OX=10090 GN=Cacna1c PE=1 SV=1 sp|Q99KG5|LSR_MOUSE Lipolysis-stimulated lipoprotein receptor OS=Mus musculus OX=10090 GN=Lsr PE=1 SV=1 sp|P00416|COX3_MOUSE Cytochrome c oxidase subunit 3 OS=Mus musculus OX=10090 GN=mt-Co3 PE=1 SV=2

Blast sseqid results: sp|P00416|COX3_MOUSE sp|P00416|COX3_MOUSE
sp|P00397|COX1_MOUSE
sp|P03899|NU3M_MOUSE

=> based on this I don't think this is the problem... I am repeating the blast without the tabular format specification which might be an easy solve?

Thanks for helping!!!

Ruth-hals commented 4 years ago

Dear @lucventurini, Re-blasting worked fine. I have now subloci and loci gff files. Thank you for all your help.

lucventurini commented 4 years ago

Dear @Ruth-hals

That's great to hear! May I ask what was the difference that made the BLAST function this time?

Ruth-hals commented 4 years ago

I didn't include the tabular format parameters (and ran the blast as suggested in the documentation with the xml output). I think my errors from before originated from the fact that I used a different fasta file for blasting and translating than the output in mikado_prepared.fasta. Thanks again! May I ask what is the difference between the subloci and loci gff files? What do you think is the best way to annotate (with uniprotID) the newly predicted gene models (i'm working with a non-model organism).

lucventurini commented 4 years ago

@Ruth-hals

The subloci file is there for diagnosing purposes only - it contains the clustering done by Mikado during the early stages of the picking. The loci file is the one you need as it contains the final selection!

As for doing the functional annotation ... not my field, but in the past we have used AHRD (https://github.com/groupschoof/AHRD) for this purpose.

Ruth-hals commented 4 years ago

Thanks again!!

Ruth-hals commented 4 years ago

@lucventurini The output of mikado yielded less genes than expected and after looking at it in IGV i was wondering what can cause this. I used mikado to merge two gtf files (in the fig the two top tracks) and the mikado output is in the bottom track, and you can see that two genes are completely missing. How is this possible?

Genemissing

14zac2 commented 3 years ago

Hi there! I just wanted to mention that I had a very similar issue to that above with the blast tabular format. My error was as follows:

... _serialise_tabular(self) File "/path/mikado_3.6/mikado_env/lib/python3.6/site-packages/Mikado/serializers/blast_serializer/tab_serialiser.py", line 41, in _serialise_tabular parser(bname=fname) File "/path/mikado_3.6/mikado_env/lib/python3.6/site-packages/Mikado/serializers/blast_serializer/tabular_utils.py", line 404, in parse_tab_blast data = sanitize_blast_data(data, queries, targets, qmult=qmult, tmult=tmult) File "/patha/mikado_3.6/mikado_env/lib/python3.6/site-packages/Mikado/serializers/blast_serializer/tabular_utils.py", line 272, in sanitize_blast_data assert ~(data[col].isna().any()), err_val AssertionError: ('slength', sseqid 0 P11260 1 P31790 2 P04025 3 P11260 4 O00370)

What solved it for me, was changing the suffix of my blast output file from .out to .tsv and adding --blast-targets uniprot_sprot.fasta to the command line of mikado serialise. I'm not sure which of those two changes did the trick, but it fixed the assertion error!

lucventurini commented 3 years ago

@lucventurini The output of mikado yielded less genes than expected and after looking at it in IGV i was wondering what can cause this. I used mikado to merge two gtf files (in the fig the two top tracks) and the mikado output is in the bottom track, and you can see that two genes are completely missing. How is this possible?

Genemissing

Hi @Ruth-hals ,

@14zac2 is right about Mikado relying on the BLAST output, so that is a good place to start with (incidentally, thank you @14zac2 for the very fast reply!).

That said, I would need a little bit of more information to diagnose what is happening with your models.

Specifically, I would need to understand:

A possibility here to diagnose the issue quicker could be to launch mikado pick with the following options:

-od debug_folder -r NC043010.1:58000-60000 -lv DEBUG

which will launch a mini-mikado only on the region above, in debug mode. The log would then help me pinpoint the origin of the issue.

lucventurini commented 3 years ago

Closing for now. Please @Ruth-hals let us know of progress on your front.