KevinMenden / scaden

Deep Learning based cell composition analysis with Scaden.
https://scaden.readthedocs.io
MIT License
71 stars 25 forks source link

Pattern matching unintuitive #95

Open khkk378 opened 3 years ago

khkk378 commented 3 years ago

I think the pattern matching is a bit unintuitive. Say I have raw and processed files in a directory and I want to simulate from either of them. Intuitively I would set pattern to e.g. *_raw_counts.txt, but that will assume the matching cell type file will be called foo_celltypes.txt rather than foo_raw_celltypes.txt. It seems you remove the pattern from the filename and then append _celltypes.txt. I think a better option would be to replace _counts.txt with _celltypes.txt for files matching the pattern.

Cheers, Rasmus

KevinMenden commented 3 years ago

Hmm yes true it can be a bit annoying. Maybe the solution would be to general improve this pattern matching. I think I wanted to give the option to manually list files at some point anyway. Although that won't be working for lots of files, there you still need patterns.

So maybe the best option is to be able to specify a --counts-pattern or a --celltype-pattern or both. If both are supplied, then it tries to find matching pairs, if only one is supplied, well it also tries to find matching pairs. Should be not too hard to add.

khkk378 commented 3 years ago

Maybe easiest to just enforce that count files should end with _counts.txt and cell types with _celltypes.txt? I don't see the use case for a lot of flexibility there. I think it's more important to be able to make a flexible selection among a collection of datasets. Then use regexps to match the rest: --pattern foo/bar_(raw|processed) to select both raw and processed samples from foo/bar for example.