calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
409 stars 126 forks source link

train/test/valid sequences for predetermined regions #151

Closed icdh99 closed 1 year ago

icdh99 commented 1 year ago

Hi!

Thank you for providing the code to generate a tf record dataset for Basenji. I would like to make tf records from an epigenetic track for the same subset of train/test/validation sequences as was used in the Basenji/Enformer model (I retrieved it from gs://basenji_barnyard/data/human/sequences.bed).

As far as I understand, the script in preprocess.py randomly divides the genome (or a subset if specified with the -s option) into train, test and validation sets. Would it be possible to change this to the preset sets from the above sequences.bed file, or did I overlook something and is this already possible?

Thank you in advance!

icdh99 commented 1 year ago

I managed to make it work by using the --restart option and placing the desired .bed file in the output folder before running the script!

davek44 commented 1 year ago

Yes, I do this sometimes via the method you described.