Modularize msa and bam generation

NCBI-Hackathons / NovoGraph

NovoGraph: building whole genome graphs from long-read-based de novo assemblies

MIT License

44 stars 8 forks source link

Modularize msa and bam generation #34

Closed TorHou closed 4 years ago

TorHou commented 4 years ago

This seperates the processes of generating MSAs and BAMs from the control logic of CALLMAFFT. I can't guarantee that this will not break things. I will run some tests to mitigate the risk.

TorHou commented 4 years ago

The main test is ongoing. Will report back here. Continuous Integration would be very nice here ...

evanbiederstedt commented 4 years ago

Continuous Integration would be very nice here

I would certainly help out with this. Do you happen to have a file which works (and which is under 100MB)?

TorHou commented 4 years ago

The test was succesful :)

I will try to get a small test. I think a good idea would be to do everything on a small pseudo-chromosome. Because the issue I ran into was the fact that the scripts want to cover a whole chromosome even if there are no covering reads

evanbiederstedt commented 4 years ago

We would simply create a Travis-CI config for the perl dependencies, and put a "test script" in /tests.

Let me know how to help out :)

TorHou commented 4 years ago

If we limit the tests to those scripts that don't need htslib the dependencies are minimal. So what I have seen done with tests before was something along those lines script.py predefined_input.data > output.data and then a simple test of diff output.data predefined_output.data Do you have something more sophisticated in mind ?

evanbiederstedt commented 4 years ago

Nothing too much more sophisticated, but I would use a unit testing framework in perl.

That allows more flexibility. So, for instance, as opposed to doing diff output.data predefined_output.data, you could use qualities of that output.data.

e.g. values in the file, file size, etc.