ML newbie trying to run Baskerville

jesspeers commented 7 months ago

Hello,

I'm very new to machine learning and wanting to run Baskerville (as suggested by @davek44 on the Basenji page).

I was wondering if you had any beginner-friendly guidance for how to implement Baskerville (e.g. how to pre-process the data and split into train/test/validation, how to train the model, etc)? I was relying quite heavily on the Basenji ipynb tutorials and I'm a little confused how to use Baskerville.

I'm hoping to supply ATAC-seq training data to the model and use the output to investigate deleterious variants in regulatory elements of model & non-model species, so Baskerville seems like the ideal tool to use, but I'm unfortunately a bit of a beginner!

Many thanks, Jess

davek44 commented 7 months ago

Hi Jess, we haven't completely ported the data preprocessing code into this new repository. I can prioritize that for you. In the case of your non-model species, you'll need to start from scratch. But for the model, assuming it's human or mouse, you can consider transfer learning from our pretrained model. We're working on scripts for that now.

Although I obviously like the tools we develop, they don't necessarily surpass simpler methods for peak data like ATAC-seq, where distal interactions aren't as important. You might also consider Anshul Kundaje's group's ChromBP-net, which is able to model the Tn5 cutting bias and nucleotide-precision cut sites from ATAC. https://github.com/kundajelab/chrombpnet

jesspeers commented 7 months ago

Hi Dave,

Thanks so much for your response! I'll have a look at ChromBP-net and come back to Baskerville once the preprocessing code is ported over.

Many thanks, Jess

jesspeers commented 6 months ago

Hi Dave,

Thanks again for all your help. I'd really like to apply Baskerville if possible so do you have a rough estimation of when the preprocessing code might be ported over?

For my application, I think transfer learning from your pretrained model should work, so do you know roughly how long it might take for those scripts to become available?

Many thanks, Jess

davek44 commented 6 months ago

Hi Jess,

I just ported the data preprocessing code and pulled into the main branch. You'll basically need to make a targets table similar to the one we used for Borzoi here: https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_human.txt

Then here's an example of how we ran the scripts for the recent Borzoi dataset: https://github.com/calico/borzoi/tree/main/src/scripts/data/training_data Just substitute "hound" for "basenji" in the script names.

Reach out if you have additional questions. We'll aim to bring in the new transfer learning script next.

Best, David

GMFranceschini commented 6 months ago

Hi @davek44, I hope it's ok to follow up on this thread as I am also new to ML on sequences.

I aim to obtain a representative feature vector of each genomic bin (say 50kb), possibly incorporating other epigenetic data like accessibility and histone mark tracks. This will ultimately be used for a classification task that would benefit from this well-built sequence representation, or at least that is my intuition.

I am working with hg19; would it be straightforward to start from a pre-trained model and get "embeddings" for those genomic bins? I am asking if this makes sense and if I am looking at the correct repo. Thank you,

Gian

davek44 commented 6 months ago

Hi Gian, this is a different enough question that I'd recommend you open a separate issue. But yes, moving to hg19 should be fine.

DavidvanBruggen commented 3 months ago

Hi Dave,

Thanks for making this great work available!

Just a question related to dropping alignment between human and mouse in the makefile approach you specified above? I want to train a borzoi model on mouse only, dropping the alignment steps, can you tell me how to run hound_data.py properly? At the moment it is not obvious for me.

Thanks!

davek44 commented 3 months ago

Hi, for a single genome, you'll simply skip the hound_data_align.py command and run hound_data.py without the --restart option and adding the -l $(LENGTH), --stride $(TSTRIDE), and --umap_t 0.5 options (which were previously handled at the align stage for multiple genomes).

calico / baskerville

ML newbie trying to run Baskerville #27