train/test/valid sequences for predetermined regions

icdh99 commented 1 year ago

Hi!

Thank you for providing the code to generate a tf record dataset for Basenji. I would like to make tf records from an epigenetic track for the same subset of train/test/validation sequences as was used in the Basenji/Enformer model (I retrieved it from gs://basenji_barnyard/data/human/sequences.bed).

As far as I understand, the script in preprocess.py randomly divides the genome (or a subset if specified with the -s option) into train, test and validation sets. Would it be possible to change this to the preset sets from the above sequences.bed file, or did I overlook something and is this already possible?

Thank you in advance!

icdh99 commented 1 year ago

I managed to make it work by using the --restart option and placing the desired .bed file in the output folder before running the script!

davek44 commented 1 year ago

Yes, I do this sometimes via the method you described.

calico / basenji

train/test/valid sequences for predetermined regions #151