basenji_data_align.py train/test/validation splits

maroll commented 1 year ago

Hi,

I am trying to use the basenji_data_align.py script to produce a dataset similar to the mouse and human cross-species set used in your 2020 PLOS Comp Bio paper, but it is giving slightly unexpected results. The resulting sequences.bed files in the output have entire chromosomes marked as either train or test/validate, while I was expecting individual chromosome sequences to be split into train, test or validate segments.

I have been comparing my results from the basenji_data_align.py script to the files reported by Enformer and HyenaDNA as being re-used from Basenji (gs://basenji_barnyard/data/human/sequences.bed) which have a within-chromosome split. I appreciate that my approach has some differences to Basenji's original method by not excluding gap files or unmappable regions and using whole genome files including non-chromosomal regions, but I would still expect the script to split train/test/validate within a chromosome and not between them.

Here is an example of a command I have tried using:

python basenji_data_align.py \
    -u empty.txt,empty.txt \
    -g empty.txt,empty.txt \
hg38.mm10.net hg38.fa,mm10.fa

The empty text files are for a stand-in for gap regions and unmappable regions (which I am not interested in exculding), as the script died without these being defined. I have downloaded the genome .fa and alignment .net files from the UCSC using the following links: Human genome: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz Mouse genome: https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz Alignment files (tried both): http://hgdownload.soe.ucsc.edu/goldenPath/hg38/vsMm10/hg38.mm10.syn.net.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/vsMm10/hg38.mm10.net.gz

Could you please advise how I can adapt my approach to get train/test/valid splits within chromosome instead of between them?

Many thanks,

Masha

davek44 commented 1 year ago

Hi Masha,

You have to use the --break option to fragment the chromosomes; otherwise, it will only break them at assembly gaps and the bipartite orthology graph will be too connected. It also helps to use stricter requirements for defining orthologous regions. We have a new paper coming out where I used 524,288 bp sequences, and I used the following options: --break 2097152 -c 163840 --nf 524288 --no 393216 -l 524288 --stride 49173 -f 8 --umap_t 0.5 -w 32

This generates 524 kb sequences, where the target data covers the center 196608 = 524288-2*163840. The sequences are strided by 49173 (=196608/4 + 21), so you shift by ~1/4 of a sequence length. You'll want to tweak all of these to fit your needs.

From there, you want to use a --break, --nf, and --no that are as small as possible, while still being able to create evenly sized folds. I now always aim for 6-8 folds, rather than a single train/valid/test. If you switch back to that, it's generally a little easier.

maroll commented 1 year ago

Hi Davek,

thank you very much for the reply and extra tips! adding --break is exactly the behaviour I was looking for.

Thanks again!

calico / basenji

basenji_data_align.py train/test/validation splits #177