calico / borzoi

RNA-seq prediction with deep convolutional neural networks.
Apache License 2.0
80 stars 10 forks source link

Genome folds #1

Closed casblaauw closed 1 year ago

casblaauw commented 1 year ago

Dear developers,

Thanks for releasing your model, I'm sure I can speak for many in the community to say that it's looking hugely impressive! To use and validate it, I'd like to see know what regions of the genome are in each of the test/validation folds that were used to the four models. For Enformer/Basenji, that was easily reconstructed from the helpfully shared sequences_[human|mouse].bed files in the public Google Storage bucket with 'supplementary' small files here, but I don't believe that's available for Borzoi yet?

Of course, it could be reconstructed from the large training dataset files, but given that I'm only looking for the genomic coordinates rather than the fully processed tracks corresponding to those, I was hoping there is an easier way.

Related to that though, all files in the borzoi-paper bucket currently don't seem to be available, as it returns the following error:

<Error>
 <Code>UserProjectMissing</Code>
 <Message>
  Bucket is a requester pays bucket but no user project provided.
 </Message>
 <Details>
  Bucket is a requester pays bucket but no user project provided.
 </Details>
</Error>

Although I'm hoping to not need those files at the moment, I figured I'd still mention it to let you know.

I'm sure the public release has left everyone swamped with questions coming in and issues popping up, so I appreciate any bit of time you are willing to spend on this!

davek44 commented 1 year ago

Thanks for your interest! I add the sequences and targets files into a data/ directory from the github, too, so you don't have to figure out GCP for that.

For model f0, sequences labeled fold0 form the test set and fold1 form validation. For model f1, sequences labeled fold1 form the test set and fold2 form validation. Etc