GERP scores - size of chunks for derived alleles calculations

NBISweden / GenErode

GitHub repository for GenErode, a Snakemake pipeline for the analysis of whole-genome sequencing data from historical and modern samples to study patterns of genome erosion.

GNU General Public License v3.0

21 stars 7 forks source link

GERP scores - size of chunks for derived alleles calculations #46

Closed ndussex closed 1 year ago

ndussex commented 1 year ago

Hi,

I am estimating gerp scores for ~70 mammalian genomes that were mapped to a chromosome-level assembly (~2.5 Gb; 35 chromosomes and ~5000 unplaced scaffolds).

As I understand, the pipeline divides the genomes and vcfs into chunk. In my case, each chunk comprises 27 contigs or so. However, since I have a chromosome-level assembly, chunk1 and chunk2 contain all of my chromosomes (35) and the other chunks cover the remaining much smaller 5000 unplaced scaffolds. So, this means that the jobs to estimate derived alleles for chunk 1 and 2 take ~10 days, whereas the jobs for the other scaffolds are done within minutes.

Would there be a way to split the genome per scaffold/contig or to specify that each chunk should cover X number of scaffolds/contigs?

Cheers, Nic

verku commented 1 year ago

Hi Nic! The code that splits up the genome assembly fasta into chunks is currently not written to provide any options like that. I don't see any good solution to this problem with the current pipeline implementation, sorry!

ndussex commented 1 year ago

Hi Verena,

I see. I don't think providing an option to change in the config file would be necessary.

I just wonder how the number of contigs is assign by chunk. It doesn't seem random, right? Could it be hard coded in the rule?

Nic

verku commented 1 year ago

It is hard-coded in python code, this was necessary because it didn't work to implement that part in Snakemake. You can have a look here: https://github.com/NBISweden/GenErode/blob/4fc1faad59e0020f915d87e2a8c4e4ff25aa9d35/workflow/rules/common.smk#L394

The function split_ref_bed takes the genome bed file that lists all contigs, and divides the number of contigs by 200 to get the final number of chunks with equal numbers of contigs (if there are more than 200 contigs, for less it divides them into one contig per chunk). The function returns one bed file per chunk and the bed files and a python list of chunk bed files is then used by the rules in the GERP step.

If you know some python you could try to change the function that creates the chunk bed files to take the length of the contigs (or chromosomes in your case) into account.

ndussex commented 1 year ago

Oh I see...

so it should be easy to fix. I can simply divide by a larger number than 200, say 500 (I have 5000 contigs) and it should work then. I don't need to take the length into account.

thank you.