genomic-medicine-sweden / tomte

A nextflow pipeline for analysing expression and splicing in RNA seq data from rare disease patient
MIT License
11 stars 3 forks source link

Allow running DROP without reference samples #152

Closed Jakob37 closed 1 month ago

Jakob37 commented 1 month ago

Description of feature

Right now it looks like the drop_sample_annot.py expects a reference.

In this case, I have a large run with 100 samples, which would serve as their own reference.

It does not seem to be supported currently. If not supplying an external reference, DROP will not run.

I am working around this by supplying an empty reference and making some changes to drop_sample_annot.py to not crash it if no data is present in the df. It would be helpful to have an "official" way to do this.

What do you think?

Sorry about the issue bombardment today 🫣

Lucpen commented 1 month ago

No worries, we are happy to get issues and improve the pipeline 😄
At the moment Tomte is designed to run only a few samples with an already existing database, not to create one. However, we have been working on modifying the code so that an actual database can be created. Here is the PR, we still need to test it thoroughly but if you want to give it a try, feel free to run it and to make any suggestions on how to improve it.

Jakob37 commented 1 month ago

No worries, we are happy to get issues and improve the pipeline 😄 At the moment Tomte is designed to run only a few samples with an already existing database, not to create one. However, we have been working on modifying the code so that an actual database can be created. Here is the PR, we still need to test it thoroughly but if you want to give it a try, feel free to run it and to make any suggestions on how to improve it.

OK, great! I'll give it a go. That sounds exactly like what we will need ahead.

Managed to get the DROP run started anyway without any reference db. We will see how that goes ...

Lucpen commented 1 month ago

Please, let me know if it works, and if it doesn't it will be better if you restart DROP outside from the pipeline as explained here

Jakob37 commented 1 month ago

Please, let me know if it works, and if it doesn't it will be better if you restart DROP outside from the pipeline as explained here

Thanks for the tips! That will be very helpful.

It made it pretty far (edit: not super far, a little bit), into the Counting_Summary step. Will see if I can figure that out today 🤔

Jakob37 commented 1 month ago

The aberrant expression run went through 🎉 I needed to remove the following cols from the produced sample_annot.tsv file: GENE_COUNTS_FILE, SEX. Otherwise both were produced filled with NA values, which DROP downstream could not handle. Seems its R parsing 'cleverly' translates string "NA" to nan.

I raised an issue in DROP about it: https://github.com/gagneurlab/drop/issues/568

Still running the splicing run. It filled our RAM when running with 64 threads, but seems to be doing fine on a smaller number of threads (12). Might require some further fiddling with the sample_annot.tsv file to get it through downstream steps I guess, we will see.

Jakob37 commented 1 month ago

Have you guys btw considered running OUTRIDER and FRASER2 outside the DROP pipeline? Seems the handful of steps could be lifted over from Snakemake to one or two Nextflow subworkflows. This would make things much cleaner with debugging, resuming caching, less dependencies on DROP..

I realize this would mean considerable extra work to set up, and it might not be feasible. Just a thought!

Jakob37 commented 1 month ago

FRASER2 pipeline ran pretty far, crashed in one of the final FRASER calculation steps. Seems to be a bug only appearing when not using external counts: https://github.com/gagneurlab/drop/issues/558