Open aureliendejode opened 7 months ago
Hi, It does look like a similar issue. That one should have been fixed in this commit that parsed the sequence IDs differently. Could you share the fasta header for the config that is in the BUSCO results so I can take a look at what needs to be done to get them to match this time
Hi, Here are the headers for the fasta file (the assembly), they all have the following format:
h1tg000309l_path h1tg000310l_path h1tg000311l_path h1tg000312l_path h1tg000313l_path h1tg000314l_path
They are contigs output from hifiasm assembly.
Then in the busco metazoa and eukaryota the contigs id are the same as in the fasta file:
270107at2759 Complete h1tg000039l_path 1923850 1928544 + 617.0 599 271586at2759 Complete h1tg000019l_path 3448379 3458776 + 297.6 389 290630at2759 Complete h1tg000004l_path 2354527 2344837 - 543.9 461 293315at2759 Complete h1tg000021l_path 1983882 1947754 - 509.3 559 296129at2759 Complete h1tg000005l_path 2926981 2938826 + 954.7 708 299766at2759 Complete h1tg000020l_path 9862506 9853025 - 495.8 436 306227at2759 Complete h1tg000025l_path 9236278 9247485 + 542.8 459
For the archaea and bacteria the "_path" has been removed: 26038at2 Fragmented h1tg000560l 68481 70196 + 151.4 265 https://www.orthodb.org/v10?query=26038at2 phosphoribosylformylglycinamidine synthase 91428at2 Missing 95696at2 Missing 143460at2 Missing 182107at2 Missing 219876at2 Missing 223233at2 Fragmented h1tg000118l 1228298 1230874 + 128.3 383 https://www.orthodb.org/v10?query=223233at2 DNA topoisomerase I
Hope that answer your question, let me know if you need more info.
Cheers Aurélien
Hi, Have you found a fix for this ? Best Aurélien
Looking at this more closely, it not straightforward to fix this in the BlobToolKit import code - the previous issue was with extra characters being added to the sequence IDs, but here they are removed so it is not as easy to put them back.
I think the best way would be a manual workaround either changing the sequence IDs in the assembly or editing the BUSCO file before import. To get the pipeline to run happily, you could also skip the bacteria and archaea lineages from the set of BUSCOs to run as these are usually included for use when exploring the results rather than required for the pipeline to run.
Thanks for this. I'll just skip the bacteria and archaea lineages for now. Best Aurélien
Hi, I removed the bacteria and archaea lineages but then I ran into an error at the diamond_blastp step. I think it is because I used "taxrule: blastp=buscogenes" for the tax rule. How can I solve this issue. Should change the taxrule for the diamond_blastp step ? Here is my config file:
assembly:
accession: SHADDVT3
file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/SHADDVT3_asm_bp_hap1_p_ctg.fasta.gz
level: contig
prefix: SHADDVT3_asm_bp_hap1_p_ctg
span: 459760080
busco:
download_dir: /share/apps/genetic-databases/busco
basal_lineages:
- eukaryota_odb10
- bacteria_odb10
- archaea_odb10
lineages:
- metazoa_odb10
- eukaryota_odb10
reads:
single:
- file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/XBTUA20231220R84050PL40740011D01bc2031bc2031_hifireads.fastq.gz
prefix: XBTUA20231220R84050PL40740011D01bc2031bc2031
platform: PACBIO_SMRT
settings:
blast_chunk: 100000
blast_max_chunks: 10
blast_min_length: 1000
blast_overlap: 0
stats_chunk: 1000
stats_windows:
- 0.1
- 0.01
- 100000
- 1000000
taxdump: /share/apps/genetic-databases/taxdump
tmp: /tmp
similarity:
blastn:
name: nt
path: /share/apps/genetic-databases/nt
defaults:
evalue: 1.0e-10
import_evalue: 1.0e-25
max_target_seqs: 10
taxrule: bestsumorder
diamond_blastp:
import_max_target_seqs: 100000
name: reference_proteomes
path: /share/apps/genetic-databases/uniprot
taxrule: blastp=buscogenes
diamond_blast:
name: reference_proteomes
path: /share/apps/genetic-databases/uniprot
taxon:
name: Stichodactyla haddoni
version: 2
I think you may just need to remove bacteria and archaea from the basal_lineages
section of the config as well
Yes, it worked but I ran into another issue at the run_blobtools_create:
Reading all TSV files in ../window_stats
Loading parsed taxdump
Traceback (most recent call last):
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/bin/blobtools", line 8, in
I have redownloaeded the newtaxdump to sure that this was not the issue. Here is what I have in the taxdump folder:
-rw-r--r-- 1 adejode bmtitus 4,9K 14 mai 08:31 gencode.dmp -rw-r--r-- 1 adejode bmtitus 452 14 mai 08:31 division.dmp -rw-r--r-- 1 adejode bmtitus 3,0K 14 mai 08:37 typeoftype.dmp -rw-r--r-- 1 adejode bmtitus 292M 14 mai 08:37 taxidlineage.dmp -rw-r--r-- 1 adejode bmtitus 321M 14 mai 08:37 rankedlineage.dmp -rw-r--r-- 1 adejode bmtitus 1,4M 14 mai 08:37 merged.dmp -rw-r--r-- 1 adejode bmtitus 3,8M 14 mai 08:37 images.dmp -rw-r--r-- 1 adejode bmtitus 691M 14 mai 08:37 fullnamelineage.dmp -rw-r--r-- 1 adejode bmtitus 4,5M 14 mai 08:37 delnodes.dmp -rw-r--r-- 1 adejode bmtitus 232M 14 mai 08:39 nodes.dmp -rw-r--r-- 1 adejode bmtitus 217M 14 mai 08:40 names.dmp -rw-r--r-- 1 adejode bmtitus 19M 14 mai 08:40 citations.dmp -rw-r--r-- 1 adejode bmtitus 26M 14 mai 08:40 typematerial.dmp -rw-r--r-- 1 adejode bmtitus 5,6M 14 mai 08:40 host.dmp -rw-r--r-- 1 adejode bmtitus 44K 14 mai 08:40 excludedfromtype.dmp -rw-r--r-- 1 adejode bmtitus 130M 14 mai 09:20 new_taxdump.tar.gz -rw-r--r-- 1 adejode bmtitus 53 14 mai 09:20 new_taxdump.tar.gz.md5 -rw-r--r-- 1 adejode bmtitus 614M 14 mai 09:25 taxdump.json
It looks like some parsing issue for one of the file but I could not figure out which one.
Any idea ?
Here is my config file:
assembly:
accession: SHADDVT3
file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/SHADDVT3_asm_bp_hap1_p_ctg.fasta.gz
level: contig
prefix: SHADDVT3_asm_bp_hap1_p_ctg
span: 459760080
busco:
download_dir: /share/apps/genetic-databases/busco
basal_lineages:
- eukaryota_odb10
lineages:
- metazoa_odb10
- eukaryota_odb10
reads:
single:
- file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/XBTUA20231220R84050PL40740011D01bc2031bc2031_hifireads.fastq.gz
prefix: XBTUA20231220R84050PL40740011D01bc2031bc2031
platform: PACBIO_SMRT
settings:
blast_chunk: 100000
blast_max_chunks: 10
blast_min_length: 1000
blast_overlap: 0
stats_chunk: 1000
stats_windows:
- 0.1
- 0.01
- 100000
- 1000000
taxdump: /grps2/bmtitus/databases/taxdump
tmp: /tmp
similarity:
blastn:
name: nt
path: /share/apps/genetic-databases/nt
defaults:
evalue: 1.0e-10
import_evalue: 1.0e-25
max_target_seqs: 10
taxrule: bestsumorder
diamond_blastp:
import_max_target_seqs: 100000
name: reference_proteomes
path: /share/apps/genetic-databases/uniprot
taxrule: blastp=buscogenes
diamond_blastx:
name: reference_proteomes
path: /share/apps/genetic-databases/uniprot
taxon:
name: Stichodactyla haddoni
version: 2
Hello,
Does anyone have an idea about this last issue ?
seq_id, start, end = re.split(r"[:-]", parts[0]) ValueError: too many values to unpack (expected 3)
This step parses the BUSCO locations from the diamond blast results file so it expects the sequence IDs to look like
seq1:start-end=busco_id=status
From the error, I guess you may have :
or -
characters in your seq IDs. Unfortunately the import won't work if this is the case, but you could remove the :start-end=busco_id=status
and import it using --taxrule buscogenes
(removing the blastp=
) or skip this taxrule and import only the diamond blastx
results.
My diamond out put looks like this: head SHADDVT3_asm_bp_hap1_p_ctg.diamond.reference_proteomes.out h1tg000621l_path 317549 790 h1tg000621l_path tr|A0A7D9DP43|A0A7D9DP43_PARCT 38.1 1138 686 13 7388 3981 480 1601 3.92e-240 790 h1tg000621l_path 317549 648 h1tg000621l_path tr|A0A7D9DC04|A0A7D9DC04_PARCT 35.4 1085 668 14 7364 4140 228 1289 7.77e-195 648 h1tg000621l_path 317549 638 h1tg000621l_path tr|A0A6S7HVJ1|A0A6S7HVJ1_PARCT 35.5 1179 684 27 7388 3993 2 1151 2.10e-193 638 h1tg000621l_path 317549 633 h1tg000621l_path tr|A0A6S7I3Z1|A0A6S7I3Z1_PARCT 34.3 1126 650 18 7469 4140 4 1055 1.69e-192 633
My diamond_blastp results look like this: head SHADDVT3_asm_bp_hap1_p_ctg.diamond.busco_genes.out 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single 6105 724 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single tr|A0A6P8II84|A0A6P8II84_ACTTE 85.9 453 55 1 1 444 15 467 1.93e-259 724 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single 2652724 696 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single tr|A0A913XFY4|A0A913XFY4_EXADI 84.0 443 61 2 1 433 15 457 8.55e-249 696 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single 45351 615 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single tr|A7RYD1|A7RYD1_NEMVE 80.3 401 70 1 1 392 10 410 8.34e-219 615
and my blastn results look like this: h1tg000625l_path 6106 281 h1tg000625l_path MN307079.1 83.121 314 48 3 6387 6699 7210 6901 5.34e-69 h1tg000625l_path 6105 239 h1tg000625l_path XM_031699469.1 80.000 365 40 15 6387 6738 2210 1866 3.26e-56 h1tg000625l_path 6105 239 h1tg000625l_path XM_031699463.1 80.000 365 40 14 6387 6738 3262 3606 3.26e-56 h1tg000625l_path 1789172 228 h1tg000625l_path OU974084.1 78.512 363 67 6 6386 6741 1087799 1088157 7.05e-53 h1tg000625l_path 1789172 209 h1tg000625l_path OU974083.1 79.048 315 59 6 6385 6698 10137538 10137230 2.55e-47
So I guess the issue comes from the diamon_blastp output file. I'll try to skip the blastp then.
I am wondering is there a way to not have this issue without having to manually edit the files ?
I think I can update the code to split only on the suffix that the pipeline adds so seq IDs with these characters can still be imported - should be able to get a release out with this fixed by the end of next week.
Looks like I need to trace this back a bit further, could you share the first few lines from the eukaryota busco full_table.tsv
file and one of the headers from a eukaryota_odb10/busco_sequences/single_copy_busco_sequences/*.faa
file as it looks like the extra characters are being added by BUSCO and included in the headers when I extract the sequences before the diamond blastp step
Hello,
Just got back from a break.
Here are the first lines of full_table.tsv
:
# BUSCO version is: 5.6.1
# The lineage dataset is: eukaryota_odb10 (Creation date: 2020-09-10, number of genomes: 70, number of BUSCOs: 255)
# Busco id Status Sequence Gene Start Gene End Strand Score Length
39650at2759 Complete h1tg000030l_path 553128 519812 - 2052.7 1448
83779at2759 Complete h1tg000019l_path 3782250 3788189 + 217.3 233
87842at2759 Complete h1tg000007l_path 2573119 2588445 + 596.6 643
97116at2759 Complete h1tg000015l_path 532409 580117 + 1220.7 954
100698at2759 Complete h1tg000043l_path 1811157 1829950 + 626.3 483
142542at2759 Complete h1tg000017l_path 10922379 10941100 + 688.8 624
156083at2759 Complete h1tg000071l_path 5097953 5077322 - 3765.1 1937
Here is a header of the eukaryota_odb10/busco_sequences/single_copy_busco_sequences/
:
(I had to decompress the busco_sequences.tar.gz
to access those)
>956854at2759_1720309_0:004dfa|h1tg000065l_path:574692-575582
MADEVVDIEVARQKMRQWRDEPLRNSEEVVEFGECLLLEHGNKLGDELWTIYEQVFLASL
DCGKLDLATMCLKELNTQFPGSIRVKKLKGMRLEALERYEDAERIYDKILEAEPANSVAY
KRKIAILKSQNKIPEAIKELNEYLKKFMSDQEAWMELIDLYISLQDLRKAKFCMEELILA
NPHNHLYHQRFAEILYTMGDLESMEKARKYFAQSLKLDNDNLRSLYGFFLSASHLSSFAK
DSKAKRENAKYAGWAGAQIMERYQSLKAGDGQEEDNMLALGGMIDSLQLNTPQNIPS
Thanks for sharing the files, The latest version (4.3.11) should be able to handle these BUSCO sequence headers properly
great thanks !
Hello,
I have encountered the followinf error: more SHADDVT3_asm_bp_hap1_p_ctg/run_blobtools_create.log
Reading all TSV files in ../window_stats Traceback (most recent call last): File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/bin/blobtools", line 8, in
sys.exit(cli())
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/blobtools.py", line 105, in cli
sys.exit(subcommand())
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 203, in cli
main(args)
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 149, in main
parsed = field["module"].parse(
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/busco.py", line 82, in parse
busco = parse_busco(file, identifiers=kwargs["dependencies"]["identifiers"])
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/busco.py", line 63, in parse_busco
raise UserWarning(
UserWarning: Contig names in the Busco file did not match dataset identifiers.
I saw people having similar issue (e.g. #2) in the past but I did not see the fix for it. In my case it looks like everything works fine for busco metazoa and eukaryota but for archaea and bacteria busco removed "_path" from the contigs id and I think that is what is causing the error.
zcat SHADDVT3_asm_bp_hap1_p_ctg.busco.bacteria_odb10/full_table.tsv.gz | head
BUSCO version is: 5.6.1 The lineage dataset is: bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124) Busco id Status Sequence Gene Start Gene End Strand Score Length OrthoDB url Description 4421at2 Missing 9601at2 Missing 26038at2 Fragmented h1tg000560l 68481 70196 + 151.4 265 https://www.orthodb.org/v10?query=26038at2 phosphoribosylformylglycinamidine synthase 91428at2 Missing 95696at2 Missing 143460at2 Missing 182107at2 Missing
Is there a fix to this or is it something that has to be done manually ?
Thanks for your help