Ploidy error - Githubissues

alistairhockey commented 1 year ago

Hi there,

I am currently triyng to detect SVs between chickpea (ref) and a close relative (var). Both are diploid species and are known to have the same no. of chromosomes. However, once I reach the "working on diploid" stage of parameter optimisation, I get this error:

`[20/12/2022, 00:33:39] WARNING: There are 8 available threads, and you required 16. [20/12/2022, 00:33:39] Running perSVade align_reads into SRR6242439.output with 25.469 Gb of RAM and 16 cores [20/12/2022, 00:34:09] Aligning reads [20/12/2022, 01:10:52] Calculating coverage per window [20/12/2022, 01:11:26] perSVade align_reads finished correctly Running module align_reads... [20/12/2022, 01:12:28] WARNING: There are 8 available threads, and you required 16. [20/12/2022, 01:12:28] Running perSVade optimize_parameters into SRR6242439.parameter_optimization with 25.459 Gb of RAM and 16 cores [20/12/2022, 01:13:10] Running parameter optimization [20/12/2022, 01:13:20] The median insert size is 179, with an absolute deviation of 58 [20/12/2022, 01:14:02] Simulating reference genome reads [20/12/2022, 14:05:12] working on simulation 1 [21/12/2022, 05:50:36] working on diploid Traceback (most recent call last): File "/perSVade/scripts/sv_functions.py", line 8289, in get_fractions_reads_for_ploidy typeGenome_toncopies = {x.split(":")[0] : int(x.split(":")[1]) for x in ploidy.split("")} File "/perSVade/scripts/sv_functions.py", line 8289, in typeGenome_toncopies = {x.split(":")[0] : int(x.split(":")[1]) for x in ploidy.split("")} IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/perSVade/scripts/optimize_parameters", line 172, in gridss_blacklisted_regions, gridss_maxcoverage, gridss_filters_dict, max_rel_coverage_to_consider_del, min_rel_coverage_to_consider_dup, df_cross_benchmark_best = fun.get_best_parameters_for_GridssClove_run(sorted_bam, opt.ref, parameter_optimisation_dir, threads=opt.threads, replace=opt.replace, n_simulated_genomes=opt.nsimulations, mitochondrial_chromosome=opt.mitochondrial_chromosome, simulation_ploidies=simulation_ploidies, range_filtering_benchmark=opt.range_filtering_benchmark, nvars=opt.nvars, real_bedpe_breakpoints=real_bedpe_breakpoints, median_insert_size=median_insert_size, median_insert_size_sd=median_insert_size_sd, tmpdir=opt.tmpdir, simulation_chromosomes=opt.simulation_chromosomes) File "/perSVade/scripts/sv_functions.py", line 10167, in get_best_parameters_for_GridssClove_run ploidy_merged_bam = get_merged_bamfile_for_ploidy(variant_bamfile=simulation_bam_file, reference_bamfile=simulated_reference_bam_file, ploidy=ploidy, replace=replace, threads=threads) File "/perSVade/scripts/sv_functions.py", line 8387, in get_merged_bamfile_for_ploidy fraction_var, fraction_ref = get_fractions_reads_for_ploidy(ploidy) File "/perSVade/scripts/sv_functions.py", line 8292, in get_fractions_reads_for_ploidy except: raise ValueError("ploidy %s is incorrect. The format is ref:2_var:1 , which would be one copy of variant genome for every two copies of reference genome"%ploidy) ValueError: ploidy diploid is incorrect. The format is ref:2_var:1 , which would be one copy of variant genome for every two copies of reference genome Running module optimize_parameters...

ERROR!!! Out status: 256 ERROR conda.cli.main_run:execute(34): Subprocess for 'conda run ['/bin/bash', '-c', 'eval export PATH="$PATH:/data/alistairh/projects/SV_calling/bin"; /bin/bash -ue /data/alistairh/projects/work/3b/de74459b67e6b581f0b75608d3a783/.command.sh']' command failed. (See above for error)`

I have also tried this on ploidy: auto, but I have received the same error. Could it be due to the file types? My assumption is that the genome reference fasta may be phased, but the fastq reads of the var are of course not. Any help would be welcome.

Kind regards, Alistair

MikiSchikora commented 1 year ago

Hi,

I have seen such confusion with the --simulation_ploidies argument across users, and I am working on a better documentation for the next release.

According to the log, I assume that you are running the 'optimize_parameters' module with '--simulation_ploidies diploid', which is the source of the error. This argument is the way you tell perSVade to simulate reads from different genome ploidies/zygosities. In diploid organisms it may be either '--simulation_ploidies diploid_homo', '--simulation_ploidies diploid_hetero' or '--simulation_ploidies diploid_homo,diploid_hetero'. So '--simulation_ploidies diploid' cannot be used. This triggers the error. These are valid options:

If you specify '--simulation_ploidies diploid_hetero', perSVade will simulate two diploid genomes where one set of chromosomes are like the reference and the other set of genomes have SVs. Like this all the simlated SVs are heterozygous, and the parameters will be optimized for these heterozygous SVs.
If you specify '--simulation_ploidies diploid_homo', perSVade will simulate two diploid genomes where both sets of chromosomes have SVs. Like this all the simlated SVs are homozygous, and the parameters will be optimized for these homozygous SVs.
If you specify '--simulation_ploidies diploid_homo,diploid_hetero', perSVade will simulate four diploid genomes. Two genomes with all heterozygous and two genomes with all homozygous SVs, and the parameters will be optimized for both homo/heterozygous SVs. Note that this option may increase the computational burden.

In the paper we used '--simulation_ploidies diploid_hetero' for diploids because its the way to put the hardest variants with the minimum number of simulations. This is what I'd recommend unless you are only inetrested in homozygous SVs. You can check our paper and/or the FAQ 'How does the parameter optimization for SV calling work?' for further clarifications.

Btw, we have not tested perSVade on input phased genomes, and I am not sure how well it'd work if the divergence between parental subgenomes is low. For example, there may be a lot of multimapping between subgenomes and we cannot guarantee that the SV calling procedure will work well. The way we used perSVade on diploids was taking one haplotype as reference genome.

I hope this helps :)

Miquel Àngel Schikora Barcelona Supercomputing Center

MikiSchikora commented 1 year ago

Hi, Is this solved? I'd close the issue if so.

Miquel Àngel Schikora Barcelona Supercomputing Center

MikiSchikora commented 1 year ago

Hi, Is this solved? I'd close the issue if so.

Miquel Àngel Schikora Barcelona Supercomputing Center

MikiSchikora commented 1 year ago

Hi, Is this solved? I'd close the issue if so.

Miquel Àngel Schikora Barcelona Supercomputing Center

Gabaldonlab / perSVade

Ploidy error #12