Better handling of genomics files.

aertslab / CREsted

Other

27 stars 1 forks source link

Better handling of genomics files. #52

Open UCDNJJ opened 3 days ago

UCDNJJ commented 3 days ago

Description of feature

Hi,

A few thoughts after working with CREsted:

Would you consider a genome style class like snapatac2 for handling the reference files (fasta, chrom sizes, gtf, region bed file).
A similar request: what is the expected format of /home/VIB.LOCAL/niklas.kempynck/nkemp/mouse/biccn/mm.chrom.sizes in the peak regression tutorial?
Is there a way to train the model without holding out chromsomes for val / test? or allowing the model to see these chromosomes after training?

Thanks and really nice work in this package!

nkempynck commented 3 days ago

Hi Nelson

Nice that you are using CREsted! To address your points:

We have not considered that but we will look into it if that would make things easier for users. For now you just need the pseudobulk bigwig files per class (for regression models), which is a standard output of most of these packages.
The chromosome sizes file is in the format of chr_name \t chromsize(int) \n.
For now it is mandatory to assign a valid split, also to make sure less experienced users don't forget to do it. If you want to use all your data, you could continue training your model as is shown in the finetuning example by restructuring your train-val-test split and continuing with an appropriate learning rate.

Thanks for bringing up these concerns! Niklas

LukasMahieu commented 2 days ago

To jump in on point 3; as a workaround to train on all regions you could split the data using strategy 'region' and select a very low fraction for both val and test so you only select one region for both. That way you'd be training on (almost) all the data and running the validation on a single region during training.