ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

Cactus-pangenome fails at make_vcf step only on real data #1416

Closed amsession closed 1 week ago

amsession commented 2 weeks ago

I am trying to run the cactus-pangenome algorithm and was able to successfully run on the example data set, however when trying to use a single chromosome of real data with just 2 species the algorithm seems to fail at the "make_vcf" step. I am unsure of how to interpret the log file beyond that.

The log file is attached, and the exact command used was "apptainer exec ~/LOCAL.INSTALL/cactus/cactus_v2.8.3.sif cactus-pangenome ./js ./XlaXpe.Chr1L.txt --outDir Chr1L --outName Chr1L --reference Xla --vcf --giraffe --gfa --gbz --maxCores 32 --restart" . This was the latest log file after trying to restart with more maxCores.

error_log5.txt

glennhickey commented 1 week ago

Hi. This looks very similar to the issue in #1402 in that it appears vg deconstruct is writing a line with no sample information

[E::bcf_write] Broken VCF record, the number of columns at Chr1L:30057051 does not match the number of samples (0 vs 1)

Are you able to share the input data with me so I can try to reproduce? Failing that, if you could share the contents of /XlaXpe.Chr1L.txt that may help a bit. Thanks

amsession commented 1 week ago

Unfortunately both fasta files are too large to share here even after compression (25MB limit). This is attempting to align sequences Chr1L sequences between Xenopus laevis v10 genome here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_017654675.1/ , and Xenopus petersii paternal assembly here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_038501925.1/ . "Chr1L" in X. laevis, "1L" in petersii. There are massive misassemblies in the maternal assembly so that should not be used. If there is an easier way to share the fastas I have directly please let me know. The .txt file is attached.

XlaXpe.Chr1L.txt

glennhickey commented 1 week ago

Thanks!! I was able to reproduce it. Will fix asap. For the record, these are the commands I used (using v2.8.3)

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/017/654/675/GCF_017654675.1_Xenopus_laevis_v10.1/GCF_017654675.1_Xenopus_laevis_v10.1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/038/501/925/GCA_038501925.1_aXenPet1.paternal.cur/GCA_038501925.1_aXenPet1.paternal.cur_genomic.fna.gz

gzip -d GCF_017654675.1_Xenopus_laevis_v10.1_genomic.fna.gz
gzip -d GCA_038501925.1_aXenPet1.paternal.cur_genomic.fna.gz

mkdir -p ./XlaChr
mkdir -p ./XpeChr

samtools faidx GCF_017654675.1_Xenopus_laevis_v10.1_genomic.fna NC_054371.1 >  ./XlaChr/Chr1L.fa
samtools faidx GCA_038501925.1_aXenPet1.paternal.cur_genomic.fna CM076672.1 >  ./XpeChr/1L.fa

printf "Xla ./XlaChr/Chr1L.fa\n" > XlaXpe.Chr1L.txt
printf "Xpe ./XpeChr/1L.fa\n" >> XlaXpe.Chr1L.txt

TOIL_SLURM_ARGS="--partition=long --time=8000" cactus-pangenome ./js ./XlaXpe.Chr1L.txt --outDir Chr1L --outName Chr1L --reference Xla --vcf --giraffe --gfa --gbz --consCores 32 --batchSystem slurm --logFile Chr1L.log --indexCores 32 --mgCores 32