Closed BartTheeuwes closed 11 months ago
Thanks for trying out the package!
This stems from an annoying nuance in using "spearman"
as the metric in the SequenceModule (SpearmanCorrCoef
). It won't automatically average the metric across tasks like R2Score
does, so PyTorch Lightning complains when you give it a 2D Tensor of correlations to log.
After playing with it for a bit, I wasn't able to get a workaround for using "spearman" with the current release on PyPI. For now, I'd recommend just training with "r2score"
as the passed in metric.
model = models.SequenceModule(
arch=model,
task="regression",
loss_fxn= "mse",
scheduler='reduce_lr_on_plateau',
optimizer="adam",
metric="r2score",
optimizer_lr=0.002
)
The metric doesn't affect the fitting of the model, but you won't get to see how Spearman changes across training. You can always calculate it on the test set post-hoc and I'm going to clean up how metrics are handled in a future release so this isn't an issue.
I was able to reproduce the issue with the DeepSTARR zarr files. There's a bug that I will patch and you can install SeqDatasets from source if you want to go that route (this line shouldn't exist: (https://github.com/ML4GLand/SeqDatasets/blob/main/seqdatasets/_datasets.py#L318C13-L318C49). However, your workaround should work just fine.
Sorry about the difficulty with concatenating objects. I love XArray, but it's not the most intuitive to use. This is how I would concatenate two SeqDatas:
import xarray as xr
sdata_train["train_val"] = True
sdata_val["train_val"] = False
sdata_training = xr.concat([sdata_train, sdata_val], dim="_sequence")
Thank you Adam,
After looking around on the ML4GLand I came across the use_cases repository. It seems like these are really useful tutorials, I'm running it for the DeepSTARR using eugene now, using the following notebooks: https://github.com/ML4GLand/use_cases/tree/main/DeepSTARR/eugene
There are some minor errors in the code that I'm tweaking one by one. The first 2 notebooks now run fine, I'm testing the attribution analysis now. I noticed that the original Eugene documentation does not link to the use_cases repository, I think it might be very helpful for new users to have tutorial more easily accessible.
Once again, thank you for creating this package.
Great, glad you found those useful. Some were created with an earlier version of the package and do need to be updated. If you end up with a set of working notebooks, feel free to submit a pull request! :)
That's a great suggestion, I will add the link to the landing page of the documentation.
Going to close this for now, but feel free to reopen if you run into any other sticking points!
First off, thank you for creating this package, I'm hoping it will speed up some of the analysis that I am planning to do. As a trial run, I was hoping to re-analyse the deAlmeida22 data in the same way as in the original paper. However, I'm running into multiple issues along the way.
Here is the code that I tried to run to analyse the data and the output describing the error. (downloading the data using seqdatasets.deAlmeida22() failed to correctly load in the sequence data, hence the strange workaround I implemented)
This resulted in the following error: