Additional Amplicon sequencing features

ThijsMaas commented 9 months ago

Hi!

I put together a few more commits related to the amplicon sequencing mode. The new feature is an input file that uses simple read counts and disables the abundance calculation altogether. We didn't need the abundance or coverage calculation, our preprocessing tool gives us the exact read count we want to simulate for every amplicon/genome.

Additionally, I have a few more things I would like to ask your input on:

The multithreading during read generation currently only works by splitting up the load for every single reference genome/amplicon. In our amplicon simulation we often have many amplicons with few reads each, so the load splitting is not efficient. A method of batching multiple amplicons to each worker would be better, but this would require a lot of rewriting of the main generate_reads() function, which is already very long. Would you mind if I send a PR with a rewrite of that function?
There looks to be some consistent code styling in here, but I do not see any mention of them, other than pycodestyle in the requirements. I have been very happy with the tools black, isort, and flake8, and setting them up using a pyproject.toml. We could also add the styling checks in the github action runner. Do you have a preference on this?

HadrienG commented 9 months ago

Hi!

The multithreading during read generation currently only works by splitting up the load for every single reference genome/amplicon. In our amplicon simulation we often have many amplicons with few reads each, so the load splitting is not efficient. A method of batching multiple amplicons to each worker would be better, but this would require a lot of rewriting of the main generate_reads() function, which is already very long. Would you mind if I send a PR with a rewrite of that function?

That would be good, and it could also be beneficial to write the reads to file instead of keeping them in memory, like #210 suggests (but optional, some people have slow disks and a lot of memory, i.e. on some academic clusters)

There looks to be some consistent code styling in here, but I do not see any mention of them, other than pycodestyle in the requirements. I have been very happy with the tools black, isort, and flake8, and setting them up using a pyproject.toml. We could also add the styling checks in the github action runner. Do you have a preference on this?

This is been a solo project thus far, the consistent-ish style is me being consistent-ish 😅 I have nothing against an automated linter, but I've never used any. Happy to read up on them and forge myself an opinion.

HadrienG commented 8 months ago

code looks good, ready to merge

HadrienG / InSilicoSeq

Additional Amplicon sequencing features #242