EI-CoreBioinformatics / mikado

Mikado is a lightweight Python3 pipeline whose purpose is to facilitate the identification of expressed loci from RNA-Seq data * and to select the best models in each locus.
https://mikado.readthedocs.io/en/stable/
GNU Lesser General Public License v3.0
94 stars 18 forks source link

Bugs in daijin #209

Closed asdcid closed 5 years ago

asdcid commented 5 years ago

Hi,

I tried to run the daijin assemble

daijin assemble -nd -C 20 -nd daijin.yaml

And then I got this error:

KeyError in line 40 of /home/raymond/devel/python/thirdparty/anaconda2/envs/mikado1.5/lib/python3.6/site-packages/Mikado-2.0rc4-py3.6-linux-x86_64.egg/Mikado/daijin/tr.snakefile:
'pick'
  File "/home/raymond/devel/python/thirdparty/anaconda2/envs/mikado1.5/lib/python3.6/site-packages/Mikado-2.0rc4-py3.6-linux-x86_64.egg/Mikado/daijin/tr.snakefile", line 40, in <module>

The configure file created by daijin configure is:

#  This is a standard configuration file for Daijin. Fields:
#  - short_reads: this section deals with RNA-Seq short read input data.
#  - name: name of the species under analysis.
#  - reference: reference data to use. A reference genome is required.
align_methods:
  hisat:
  - ''
asm_methods:
  class:
  - ''
  cufflinks:
  - ''
  scallop:
  - ''
  stringtie:
  - ''
  trinity:
  - ''
  trinitydn: false
blastx:
  chunks: 1
  evalue: 1.0e-07
  max_target_seqs: 10
  prot_db:
  - Reference/Aedes_aegypti.fasta
extra:
  #  Options related to indexing.
  star_index: ''
long_read_align_methods: {}
long_reads:
  #  Parameters related to long reads to use for the assemblies.
  files: []
  samples: []
  skip_split: true
  strandedness: []
mikado:
  db_settings:
    #  Settings related to DB connection. Parameters:
    #  db: the DB to connect to. Required. Default: mikado.db
    #  dbtype: Type of DB to use. Choices: sqlite, postgresql, mysql. Default: sqlite.
    #  dbhost: Host of the database. Unused if dbtype is sqlite. Default: localhost
    #  dbuser: DB user. Default:
    #  dbpasswd: DB password for the user. Default:
    #  dbport: Integer. It indicates the default port for the DB.
    db: mikado.db
    dbhost: localhost
    dbpasswd: ''
    dbport: 0
    dbtype: sqlite
    dbuser: ''
  modes:
  - permissive
  use_diamond: true
  use_prodigal: true
name: Dmelanogaster
out_dir: Dmelanogaster
portcullis:
  #  Options related to portcullis
  canonical_juncs: C,S
  do: true
reference:
  genome: Reference/Drosophila_melanogaster.BDGP6.dna.toplevel.fa
  genome_fai: ''
  transcriptome: ''
scheduler: ''
short_reads:
  #  Parameters related to the reads to use for the assemblies. Voices:
  #  - r1: array of left read files.
  #  - r2: array of right read files. It must be of the same length of r1; if one
  #    one or more of the samples are single-end reads, add an empty string.
  #  - samples: array of the sample names. It must be of the same length of r1.
  #  - strandedness: array of strand-specificity of the samples. It must be of the
  #    same length of r1. Valid values: fr-firststrand, fr-secondstrand, fr-unstranded.
  r1:
  - Reference/Reads/ERR1662533_1.fastq.gz
  r2:
  - Reference/Reads/ERR1662533_2.fastq.gz
  samples:
  - ERR1662533
  strandedness:
  - fr-unstranded
tgg:
  #  Options related to genome-guided Trinity.
  coverage: 0.7
  identity: 0.95
  max_mem: 6000
  npaths: 0
threads: 2
transdecoder:
  execute: true
  min_protein_len: 30

The version I used is Mikado v2.0rc4. Compared to the version v1.2.4 on conda, it seems that the v2.0rc4 missed the intron_len, scoring_file and other information.

Also, it seems that there are some bugs in Mikado/daijin/tr.snakefile, such as line 389 @functools.lru_cahe(maxsize=4, typed=True) (missing a c in cahe), line 709 output: touch(os.path.join(ALIGN_DIR, "gmap", "index", NAME, "index.done") #os.path.join(ALIGN_DIR, "gmap", "index", NAME, NAME+".sachildguide1024") , missing a ")" after "index.done)".

In the rule asm_map_trinitygg, the variable SAMPLE_MAP[wildcards.sample] and params.strandedness.

Cheers, Raymond

lucventurini commented 5 years ago

Dear @asdcid , thank you for reporting this. We are planning to retire daijin assemble soon, but I will try to fix the bugs you found as quickly as possible.

The problem most likely stems from the fact that I reorganised the configuration file recently (as it had become sprawling and with a lot of duplicated values). Hopefully it should not take too long to fix.

lucventurini commented 5 years ago

Dear @asdcid , I should have solved the small issues you reported. I will keep this report open until I have put a proper testing for daijin assemble in place.

Many thanks for reporting this, we would have released with a bugged Snakefile otherwise.

lucventurini commented 5 years ago

Dear @asdcid , I have now implemented a proper test for daijin assemble. While doing so, today I fixed a very large number of bugs in the pipeline.

Once the travis check completes successfully, I will merge back into the master branch and close the issue.

Thank you again for reporting and prodding me to clean up the code in this section.

asdcid commented 5 years ago

Thanks for your help. However, I have another question. I tried to run mikado (with permissive mode) with trinity and scallop assembly results, but it seems that the BUSCO complete score in the final mikado result pick/mikado-permissive.loci.gff3 is pretty low (~30%, vs ~90% for original trinity or scallop assemblies). Do you have any idea about that?

Thank you.

asdcid commented 5 years ago

I think I found the answer. I am using daijin mikado, it seems that neither blastx nor diamond was run.

Also, it seems that the --use-diamond always is true in the configure file even I set --use-blast in mikado configure.

The dag file for daijin mikado is attached: dag.pdf

lucventurini commented 5 years ago

Dear @asdcid, thank you again for your report. May I ask whether you specified one or more protein FASTA files during configuration? If they are missing, that would explain why mikado did not perform a blast run.

I will check and correct the bug regarding --use-blast as soon as possible.

lucventurini commented 5 years ago

Dear @asdcid , unfortunately I cannot reproduce the bug regarding --use-blast with the latest version of the code. I just trialled and daijin correctly used BLAST+ instead of DIAMOND.

I am now testing it in Travis (see https://travis-ci.org/lucventurini/mikado/jobs/585974482), where I can confirm that the bug does not present itself.

Regarding your more concerning point:

However, I have another question. I tried to run mikado (with permissive mode) with trinity and scallop assembly results, but it seems that the BUSCO complete score in the final mikado result pick/mikado-permissive.loci.gff3 is pretty low (~30%, vs ~90% for original trinity or scallop assemblies). Do you have any idea about that?

This is indeed not great. Please let me know if adding BLAST datasets solves the issue. If it does not, I will create another ticket to investigate the matter.

lucventurini commented 5 years ago

Closing as now daijin performs as expected. @asdcid , please let me know about BUSCO. If it is still not behaving properly, we will open another ticket.