Closed ndussex closed 1 year ago
Hi Nic! The code that splits up the genome assembly fasta into chunks is currently not written to provide any options like that. I don't see any good solution to this problem with the current pipeline implementation, sorry!
Hi Verena,
I see. I don't think providing an option to change in the config file would be necessary.
I just wonder how the number of contigs is assign by chunk. It doesn't seem random, right? Could it be hard coded in the rule?
Nic
It is hard-coded in python code, this was necessary because it didn't work to implement that part in Snakemake. You can have a look here: https://github.com/NBISweden/GenErode/blob/4fc1faad59e0020f915d87e2a8c4e4ff25aa9d35/workflow/rules/common.smk#L394
The function split_ref_bed
takes the genome bed file that lists all contigs, and divides the number of contigs by 200 to get the final number of chunks with equal numbers of contigs (if there are more than 200 contigs, for less it divides them into one contig per chunk). The function returns one bed file per chunk and the bed files and a python list of chunk bed files is then used by the rules in the GERP step.
If you know some python you could try to change the function that creates the chunk bed files to take the length of the contigs (or chromosomes in your case) into account.
Oh I see...
so it should be easy to fix. I can simply divide by a larger number than 200, say 500 (I have 5000 contigs) and it should work then. I don't need to take the length into account.
thank you.
Hi,
I am estimating gerp scores for ~70 mammalian genomes that were mapped to a chromosome-level assembly (~2.5 Gb; 35 chromosomes and ~5000 unplaced scaffolds).
As I understand, the pipeline divides the genomes and vcfs into chunk. In my case, each chunk comprises 27 contigs or so. However, since I have a chromosome-level assembly, chunk1 and chunk2 contain all of my chromosomes (35) and the other chunks cover the remaining much smaller 5000 unplaced scaffolds. So, this means that the jobs to estimate derived alleles for chunk 1 and 2 take ~10 days, whereas the jobs for the other scaffolds are done within minutes.
Would there be a way to split the genome per scaffold/contig or to specify that each chunk should cover X number of scaffolds/contigs?
Cheers, Nic