blobtoolkit / blobtoolkit

Interactive quality assessment of genome assemblies
http://blobtoolkit.genomehubs.org
MIT License
80 stars 9 forks source link

Error with blobtools_create step #213

Open aureliendejode opened 2 months ago

aureliendejode commented 2 months ago

Hello,

I have encountered the followinf error: more SHADDVT3_asm_bp_hap1_p_ctg/run_blobtools_create.log

Reading all TSV files in ../window_stats Traceback (most recent call last): File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/bin/blobtools", line 8, in sys.exit(cli()) File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/blobtools.py", line 105, in cli sys.exit(subcommand()) File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 203, in cli main(args) File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 149, in main parsed = field["module"].parse( File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/busco.py", line 82, in parse busco = parse_busco(file, identifiers=kwargs["dependencies"]["identifiers"]) File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/busco.py", line 63, in parse_busco raise UserWarning( UserWarning: Contig names in the Busco file did not match dataset identifiers.

I saw people having similar issue (e.g. #2) in the past but I did not see the fix for it. In my case it looks like everything works fine for busco metazoa and eukaryota but for archaea and bacteria busco removed "_path" from the contigs id and I think that is what is causing the error.

zcat SHADDVT3_asm_bp_hap1_p_ctg.busco.bacteria_odb10/full_table.tsv.gz | head

BUSCO version is: 5.6.1 The lineage dataset is: bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124) Busco id Status Sequence Gene Start Gene End Strand Score Length OrthoDB url Description 4421at2 Missing 9601at2 Missing 26038at2 Fragmented h1tg000560l 68481 70196 + 151.4 265 https://www.orthodb.org/v10?query=26038at2 phosphoribosylformylglycinamidine synthase 91428at2 Missing 95696at2 Missing 143460at2 Missing 182107at2 Missing

Is there a fix to this or is it something that has to be done manually ?

Thanks for your help

rjchallis commented 2 months ago

Hi, It does look like a similar issue. That one should have been fixed in this commit that parsed the sequence IDs differently. Could you share the fasta header for the config that is in the BUSCO results so I can take a look at what needs to be done to get them to match this time

aureliendejode commented 2 months ago

Hi, Here are the headers for the fasta file (the assembly), they all have the following format:

h1tg000309l_path h1tg000310l_path h1tg000311l_path h1tg000312l_path h1tg000313l_path h1tg000314l_path

They are contigs output from hifiasm assembly.

Then in the busco metazoa and eukaryota the contigs id are the same as in the fasta file:

270107at2759 Complete h1tg000039l_path 1923850 1928544 + 617.0 599 271586at2759 Complete h1tg000019l_path 3448379 3458776 + 297.6 389 290630at2759 Complete h1tg000004l_path 2354527 2344837 - 543.9 461 293315at2759 Complete h1tg000021l_path 1983882 1947754 - 509.3 559 296129at2759 Complete h1tg000005l_path 2926981 2938826 + 954.7 708 299766at2759 Complete h1tg000020l_path 9862506 9853025 - 495.8 436 306227at2759 Complete h1tg000025l_path 9236278 9247485 + 542.8 459

For the archaea and bacteria the "_path" has been removed: 26038at2 Fragmented h1tg000560l 68481 70196 + 151.4 265 https://www.orthodb.org/v10?query=26038at2 phosphoribosylformylglycinamidine synthase 91428at2 Missing 95696at2 Missing 143460at2 Missing 182107at2 Missing 219876at2 Missing 223233at2 Fragmented h1tg000118l 1228298 1230874 + 128.3 383 https://www.orthodb.org/v10?query=223233at2 DNA topoisomerase I

Hope that answer your question, let me know if you need more info.

Cheers Aurélien

aureliendejode commented 2 months ago

Hi, Have you found a fix for this ? Best Aurélien

rjchallis commented 2 months ago

Looking at this more closely, it not straightforward to fix this in the BlobToolKit import code - the previous issue was with extra characters being added to the sequence IDs, but here they are removed so it is not as easy to put them back.

I think the best way would be a manual workaround either changing the sequence IDs in the assembly or editing the BUSCO file before import. To get the pipeline to run happily, you could also skip the bacteria and archaea lineages from the set of BUSCOs to run as these are usually included for use when exploring the results rather than required for the pipeline to run.

aureliendejode commented 2 months ago

Thanks for this. I'll just skip the bacteria and archaea lineages for now. Best Aurélien

aureliendejode commented 2 months ago

Hi, I removed the bacteria and archaea lineages but then I ran into an error at the diamond_blastp step. I think it is because I used "taxrule: blastp=buscogenes" for the tax rule. How can I solve this issue. Should change the taxrule for the diamond_blastp step ? Here is my config file:

assembly:
  accession: SHADDVT3
  file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/SHADDVT3_asm_bp_hap1_p_ctg.fasta.gz
  level: contig
  prefix: SHADDVT3_asm_bp_hap1_p_ctg
  span: 459760080
busco:
  download_dir: /share/apps/genetic-databases/busco 
  basal_lineages:
    - eukaryota_odb10
    - bacteria_odb10
    - archaea_odb10
  lineages:
   - metazoa_odb10
   - eukaryota_odb10
reads:
  single:
   - file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/XBTUA20231220R84050PL40740011D01bc2031bc2031_hifireads.fastq.gz
     prefix: XBTUA20231220R84050PL40740011D01bc2031bc2031
     platform: PACBIO_SMRT
settings:
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_min_length: 1000
  blast_overlap: 0
  stats_chunk: 1000
  stats_windows:
  - 0.1
  - 0.01
  - 100000
  - 1000000
  taxdump: /share/apps/genetic-databases/taxdump
  tmp: /tmp
similarity:
  blastn:
    name: nt
    path: /share/apps/genetic-databases/nt
  defaults:
    evalue: 1.0e-10
    import_evalue: 1.0e-25
    max_target_seqs: 10
    taxrule: bestsumorder
  diamond_blastp:
    import_max_target_seqs: 100000
    name: reference_proteomes
    path: /share/apps/genetic-databases/uniprot
    taxrule: blastp=buscogenes
  diamond_blast:
    name: reference_proteomes
    path: /share/apps/genetic-databases/uniprot
taxon:
  name: Stichodactyla haddoni

version: 2
rjchallis commented 2 months ago

I think you may just need to remove bacteria and archaea from the basal_lineages section of the config as well

aureliendejode commented 2 months ago

Yes, it worked but I ran into another issue at the run_blobtools_create:

Reading all TSV files in ../window_stats Loading parsed taxdump Traceback (most recent call last): File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/bin/blobtools", line 8, in sys.exit(cli()) File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/blobtools.py", line 105, in cli sys.exit(subcommand()) File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 203, in cli main(args) File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 149, in main parsed = field["module"].parse( File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/hits.py", line 541, in parse blast = parse_blast( File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/hits.py", line 54, in parse_blast seq_id, start, end = re.split(r"[:-]", parts[0]) ValueError: too many values to unpack (expected 3)

I have redownloaeded the newtaxdump to sure that this was not the issue. Here is what I have in the taxdump folder:

-rw-r--r-- 1 adejode bmtitus 4,9K 14 mai 08:31 gencode.dmp -rw-r--r-- 1 adejode bmtitus 452 14 mai 08:31 division.dmp -rw-r--r-- 1 adejode bmtitus 3,0K 14 mai 08:37 typeoftype.dmp -rw-r--r-- 1 adejode bmtitus 292M 14 mai 08:37 taxidlineage.dmp -rw-r--r-- 1 adejode bmtitus 321M 14 mai 08:37 rankedlineage.dmp -rw-r--r-- 1 adejode bmtitus 1,4M 14 mai 08:37 merged.dmp -rw-r--r-- 1 adejode bmtitus 3,8M 14 mai 08:37 images.dmp -rw-r--r-- 1 adejode bmtitus 691M 14 mai 08:37 fullnamelineage.dmp -rw-r--r-- 1 adejode bmtitus 4,5M 14 mai 08:37 delnodes.dmp -rw-r--r-- 1 adejode bmtitus 232M 14 mai 08:39 nodes.dmp -rw-r--r-- 1 adejode bmtitus 217M 14 mai 08:40 names.dmp -rw-r--r-- 1 adejode bmtitus 19M 14 mai 08:40 citations.dmp -rw-r--r-- 1 adejode bmtitus 26M 14 mai 08:40 typematerial.dmp -rw-r--r-- 1 adejode bmtitus 5,6M 14 mai 08:40 host.dmp -rw-r--r-- 1 adejode bmtitus 44K 14 mai 08:40 excludedfromtype.dmp -rw-r--r-- 1 adejode bmtitus 130M 14 mai 09:20 new_taxdump.tar.gz -rw-r--r-- 1 adejode bmtitus 53 14 mai 09:20 new_taxdump.tar.gz.md5 -rw-r--r-- 1 adejode bmtitus 614M 14 mai 09:25 taxdump.json

It looks like some parsing issue for one of the file but I could not figure out which one.

Any idea ?

Here is my config file:

assembly:
  accession: SHADDVT3
  file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/SHADDVT3_asm_bp_hap1_p_ctg.fasta.gz
  level: contig
  prefix: SHADDVT3_asm_bp_hap1_p_ctg
  span: 459760080
busco:
  download_dir: /share/apps/genetic-databases/busco 
  basal_lineages:
   - eukaryota_odb10
  lineages:
   - metazoa_odb10
   - eukaryota_odb10
reads:
  single:
   - file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/XBTUA20231220R84050PL40740011D01bc2031bc2031_hifireads.fastq.gz
     prefix: XBTUA20231220R84050PL40740011D01bc2031bc2031
     platform: PACBIO_SMRT
settings:
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_min_length: 1000
  blast_overlap: 0
  stats_chunk: 1000
  stats_windows:
  - 0.1
  - 0.01
  - 100000
  - 1000000
  taxdump: /grps2/bmtitus/databases/taxdump
  tmp: /tmp
similarity:
  blastn:
    name: nt
    path: /share/apps/genetic-databases/nt
  defaults:
    evalue: 1.0e-10
    import_evalue: 1.0e-25
    max_target_seqs: 10
    taxrule: bestsumorder
  diamond_blastp:
    import_max_target_seqs: 100000
    name: reference_proteomes
    path: /share/apps/genetic-databases/uniprot
    taxrule: blastp=buscogenes
  diamond_blastx:
    name: reference_proteomes
    path: /share/apps/genetic-databases/uniprot
taxon:
  name: Stichodactyla haddoni

version: 2
aureliendejode commented 1 month ago

Hello,

Does anyone have an idea about this last issue ?

seq_id, start, end = re.split(r"[:-]", parts[0]) ValueError: too many values to unpack (expected 3)

rjchallis commented 1 month ago

This step parses the BUSCO locations from the diamond blast results file so it expects the sequence IDs to look like

seq1:start-end=busco_id=status

From the error, I guess you may have : or - characters in your seq IDs. Unfortunately the import won't work if this is the case, but you could remove the :start-end=busco_id=status and import it using --taxrule buscogenes (removing the blastp=) or skip this taxrule and import only the diamond blastx results.

aureliendejode commented 1 month ago

My diamond out put looks like this: head SHADDVT3_asm_bp_hap1_p_ctg.diamond.reference_proteomes.out h1tg000621l_path 317549 790 h1tg000621l_path tr|A0A7D9DP43|A0A7D9DP43_PARCT 38.1 1138 686 13 7388 3981 480 1601 3.92e-240 790 h1tg000621l_path 317549 648 h1tg000621l_path tr|A0A7D9DC04|A0A7D9DC04_PARCT 35.4 1085 668 14 7364 4140 228 1289 7.77e-195 648 h1tg000621l_path 317549 638 h1tg000621l_path tr|A0A6S7HVJ1|A0A6S7HVJ1_PARCT 35.5 1179 684 27 7388 3993 2 1151 2.10e-193 638 h1tg000621l_path 317549 633 h1tg000621l_path tr|A0A6S7I3Z1|A0A6S7I3Z1_PARCT 34.3 1126 650 18 7469 4140 4 1055 1.69e-192 633

My diamond_blastp results look like this: head SHADDVT3_asm_bp_hap1_p_ctg.diamond.busco_genes.out 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single 6105 724 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single tr|A0A6P8II84|A0A6P8II84_ACTTE 85.9 453 55 1 1 444 15 467 1.93e-259 724 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single 2652724 696 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single tr|A0A913XFY4|A0A913XFY4_EXADI 84.0 443 61 2 1 433 15 457 8.55e-249 696 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single 45351 615 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single tr|A7RYD1|A7RYD1_NEMVE 80.3 401 70 1 1 392 10 410 8.34e-219 615

and my blastn results look like this: h1tg000625l_path 6106 281 h1tg000625l_path MN307079.1 83.121 314 48 3 6387 6699 7210 6901 5.34e-69 h1tg000625l_path 6105 239 h1tg000625l_path XM_031699469.1 80.000 365 40 15 6387 6738 2210 1866 3.26e-56 h1tg000625l_path 6105 239 h1tg000625l_path XM_031699463.1 80.000 365 40 14 6387 6738 3262 3606 3.26e-56 h1tg000625l_path 1789172 228 h1tg000625l_path OU974084.1 78.512 363 67 6 6386 6741 1087799 1088157 7.05e-53 h1tg000625l_path 1789172 209 h1tg000625l_path OU974083.1 79.048 315 59 6 6385 6698 10137538 10137230 2.55e-47

So I guess the issue comes from the diamon_blastp output file. I'll try to skip the blastp then.

I am wondering is there a way to not have this issue without having to manually edit the files ?

rjchallis commented 1 month ago

I think I can update the code to split only on the suffix that the pipeline adds so seq IDs with these characters can still be imported - should be able to get a release out with this fixed by the end of next week.

rjchallis commented 1 month ago

Looks like I need to trace this back a bit further, could you share the first few lines from the eukaryota busco full_table.tsv file and one of the headers from a eukaryota_odb10/busco_sequences/single_copy_busco_sequences/*.faa file as it looks like the extra characters are being added by BUSCO and included in the headers when I extract the sequences before the diamond blastp step

aureliendejode commented 1 month ago

Hello, Just got back from a break. Here are the first lines of full_table.tsv:

# BUSCO version is: 5.6.1 
# The lineage dataset is: eukaryota_odb10 (Creation date: 2020-09-10, number of genomes: 70, number of BUSCOs: 255)
# Busco id  Status  Sequence    Gene Start  Gene End    Strand  Score   Length
39650at2759 Complete    h1tg000030l_path    553128  519812  -   2052.7  1448
83779at2759 Complete    h1tg000019l_path    3782250 3788189 +   217.3   233
87842at2759 Complete    h1tg000007l_path    2573119 2588445 +   596.6   643
97116at2759 Complete    h1tg000015l_path    532409  580117  +   1220.7  954
100698at2759    Complete    h1tg000043l_path    1811157 1829950 +   626.3   483
142542at2759    Complete    h1tg000017l_path    10922379    10941100    +   688.8   624
156083at2759    Complete    h1tg000071l_path    5097953 5077322 -   3765.1  1937

Here is a header of the eukaryota_odb10/busco_sequences/single_copy_busco_sequences/: (I had to decompress the busco_sequences.tar.gz to access those)

>956854at2759_1720309_0:004dfa|h1tg000065l_path:574692-575582
MADEVVDIEVARQKMRQWRDEPLRNSEEVVEFGECLLLEHGNKLGDELWTIYEQVFLASL
DCGKLDLATMCLKELNTQFPGSIRVKKLKGMRLEALERYEDAERIYDKILEAEPANSVAY
KRKIAILKSQNKIPEAIKELNEYLKKFMSDQEAWMELIDLYISLQDLRKAKFCMEELILA
NPHNHLYHQRFAEILYTMGDLESMEKARKYFAQSLKLDNDNLRSLYGFFLSASHLSSFAK
DSKAKRENAKYAGWAGAQIMERYQSLKAGDGQEEDNMLALGGMIDSLQLNTPQNIPS
rjchallis commented 3 weeks ago

Thanks for sharing the files, The latest version (4.3.11) should be able to handle these BUSCO sequence headers properly

aureliendejode commented 3 weeks ago

great thanks !