Psy-Fer / buttery-eel

The buttery eel - a slow5 guppy/dorado basecaller wrapper
MIT License
34 stars 2 forks source link

Model convention and availability of the aligner in buttery-eel #52

Open hasindu2008 opened 1 month ago

hasindu2008 commented 1 month ago

@SBurnard Moving this https://github.com/hasindu2008/nci-scripts/issues/1 conversation to here as these questions are about buttery-eel.

It is a great suggestion about the model conversions. @Psy-Fer, we should maintain some server to standalone model mapping page in https://github.com/Psy-Fer/buttery-eel/blob/main/docs/. The tricky thing is these models keep changing from version to version, so perhaps we can document the use of the following command.

cd /path/to/ont-dorado-server/data/
grep "model" *.cfg | tr ':' '\t' | tr '=' '\t' | awk '{print $1"\t"$2"\t"$3}' | sort -k1,1

@SBurnard Buttery-eel relies on the dorado-server from ONT (which does the live basecalling in MinKNOW) to implement the basecalling. So these model configuration convention comes from ONT's dorado-server and due to some reason, they have a different convention in standalone Dorado. How I find the models is as follows on Gadi:

cd /g/data/if89/apps/buttery-eel/0.5.1+dorado7.4.12/ont-dorado-server/data/
grep "model" *.cfg | tr ':' '\t' | tr '=' '\t' | awk '{print $1"\t"$2"\t"$3}' | sort -k1,1

About the second question, slow5-dorado is a fork of the standalone Dorado, so all the extra features in Dorado such as alignment are there. But we have not made a release recently:

  1. Dorado has a zillion dependencies and make a few days to get the things compiled
  2. The codebase changes are upside-down changes making it hard to keep adding the slow5 support

The good thing with the dorado-server is we can simply get the binary from ONT and use the client-server approach (implemented in buttery-eel) to access BLOW5 files.

I am not sure if Dorado server supports alignment. @Psy-Fer Does it? However, even if it supports alignment I personally believe that having basecalling and alignment to be modular has greater benefits:

  1. The user can transparently know which minimap version and parameters they use, can tune parameters for their needs and even change to a different aligner if they wish
  2. Having it separate means that users will likely cite those aligners they use, which would otherwise be just buried under "Dorado"
  3. I rather trust standalone minimap2 than a modified version coming from ONT. In fact, in f5c, several issues that arose ended up finally being attributed to some weird thing in Dorado alignment
  4. ONT has a track record of NOT honouring backward compatibility, so there is a chance that the API for getting the alignment information would keep changing (thus we will get an extra thing to rewrite things everytime)
  5. Having separate modules improves the maintainability. "One tool does all the things" approach leads to complex systems that have their own set of problems, and would create a dependency and maintenance nightmare.
  6. I can go on .....

Let me cite the following extract from Heng Li, Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences, Bioinformatics "We hope in this process, the community could standardize the input and output formats of various tools, so that a developer could focus on a component he or she understands best. Such a modular approach has been proved to be fruitful in the development of short-read tools—in fact, the best short-read pipelines all consist of components developed by different groups—and will be equally beneficial to the future development of long-read mappers and assemblers."

I understand that having a single command that runs all could be convenient, but not sure if it really worth considering the above factors. What do you think?

Psy-Fer commented 1 month ago

Hey,

Okay, so let me go through these

It is a great suggestion about the model conversions. @Psy-Fer, we should maintain some server to standalone model mapping page in https://github.com/Psy-Fer/buttery-eel/blob/main/docs/. The tricky thing is these models keep changing from version to version, so perhaps we can document the use of the following command.

The "new" dorado method is not present in the dorado-server method, as the dorado-server method is still based on the old guppy basecaller paradigms. The model naming and command line methods will only change in buttery-eel as they change in dorado-server, as I'm not going to try and do translations between 2 tools. Also keep in mind, that dorado-server is what is running when you run MinKNOW live basecalling, and the models are still evoked this way internally with MinKNOW.

However, adding a table in the docs to show the model names and locations could be helpful, as well as providing some cmdline code or a script for dumping the model info for the user.

I am not sure if Dorado server supports alignment. @Psy-Fer Does it?

Dorado-server does support alignment, however buttery-eel does not, and will not, support it. This decision was made for 3 reasons, some of which @hasindu2008 already mentioned

  1. It is my firm belief that tools that run chronologically should be run that way, rather than interleaved, when possible unless there is some specific need for it. There is no time saving doing alignment during basecalling vs after and I would guess it's actually faster to do them separately.
  2. Maintaining alignment as a function of buttery-eel adds extra work on top of already dealing with constant breaking changes from ONT (and my own bugs). Alignment has nothing to do with S/Blow5 files, so is not part of the solution butter-eel is trying to solve
  3. I don't trust alignment in dorado or dorado-server and neither should you

further to this, if you want to run everything in "one command", a few lines in a bash script will get you there.

I hope this helps to clarify things. I know I have some strong opinions here, but they are not without some pretty strong evidence over the years working with these ONT tools/APIs.

Let me know if you have more questions

Cheers, James

jelber2 commented 1 month ago

If I understand it correctly, you might provide a table of the conversions between Dorado and Dorado-Server models. That would be very helpful based on the differences I have seen so far and point out by both of you. If I understand it correctly for example dna_r10.4.1_e8.2_400bps_sup.cfg in ont-dorado-server 7.4.12 is actually the Dorado dna_r10.4.1_e8.2_400bps_sup@v4.3.0 model?

Psy-Fer commented 1 month ago

When you basecall, the actual model and model version are printed into the first line of each fastq record or in the header lines of the uSAM.

Otherwise it's a bit of a pain to figure out for sure depending on which version of the basecaller you are using. Guppy 6.5.7 vs dorado-sever 7.4.12 or whatever. The cgf files point to a json model file and somewhere between all of that is the actual information, though it's changed between guppy and Dorado, and changed again in some of the latest updates. Because buttery-eel is backwards compatible with guppy back to around version 4.* And not released for each ont release, it further complicates matters with this kind of thing (though greatly simplifies it from a development and user point of view)

But yes, we will try to make a conversion table as a guide, though from what I've seen so far, Dorado and dorado-sever don't really map 1:1 with each other. Ont Devs have confirmed that is the case too.

James

hasindu2008 commented 1 month ago

OK, I wrote some ugly one-liners to deduce these:

# change this to the server version you want
cd /path/to/ont-dorado-server/data/

# nucleotide models
echo "## nucleotide models"; echo; echo -e "|Dorado standalone model|Dorado server model|\n|---|---|" ; grep "dorado_model_path" *.cfg | tr ':' '\t' | tr '=' '\t' | awk '{print "|"$3"|"$1"|"}' | grep -v "modbases\|duplex"; echo; 

# modification models
echo "## modification models"; echo; echo -e "|Dorado standalone base model|Dorado standalone modmodel |Dorado server model|\n|---|---|---|"; for file in *modbases*.cfg ; do dorado_model=$(grep "dorado_model_path" $file | awk '{print $3}'); dorado_modbase=$(grep "dorado_modbase_models" $file | awk '{print $3}'); echo -e "|$dorado_model|$dorado_modbase|$file|"; done;

Dorado server 7.4.12

nucleotide models

Dorado standalone model Dorado server model
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac.cfg
dna_r10.4.1_e8.2_260bps_sup@v4.1.0 dna_r10.4.1_e8.2_260bps_sup.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.3.0 dna_r10.4.1_e8.2_400bps_5khz_fast.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.3.0 dna_r10.4.1_e8.2_400bps_5khz_hac.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.3.0 dna_r10.4.1_e8.2_400bps_5khz_sup.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.1.0 dna_r10.4.1_e8.2_400bps_sup.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_450bps_fast.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_450bps_hac.cfg
dna_r9.4.1_e8_sup@v3.3 dna_r9.4.1_450bps_sup.cfg
rna002_70bps_fast@v3 rna_r9.4.1_70bps_fast.cfg
rna002_70bps_hac@v3 rna_r9.4.1_70bps_hac.cfg
rna004_130bps_fast@v3.0.1 rna_rp4_130bps_fast.cfg
rna004_130bps_hac@v3.0.1 rna_rp4_130bps_hac.cfg
rna004_130bps_sup@v3.0.1 rna_rp4_130bps_sup.cfg

modification models

 for file in *modbases*.cfg ; do dorado_model=$(grep "dorado_model_path" $file | awk '{print $3}'); dorado_modbase=$(grep "dorado_modbase_models" $file | awk '{print $3}'); echo -e "|$dorado_model|$dorado_modbase|$file|"; done
Dorado standalone base model Dorado standalone modmodel Dorado server model
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_fast.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_260bps_sup@v4.1.0 dna_r10.4.1_e8.2_260bps_sup@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_sup.cfg
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_fast.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_260bps_sup@v4.1.0 dna_r10.4.1_e8.2_260bps_sup@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_sup.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.3.0 dna_r10.4.1_e8.2_400bps_hac@v4.3.0_5mCG_5hmCG@v1 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.3.0 dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mCG_5hmCG@v1 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_sup.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.3.0 dna_r10.4.1_e8.2_400bps_hac@v4.3.0_5mC_5hmC@v1 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_hac.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.3.0 dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mC_5hmC@v1 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_sup.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.3.0 dna_r10.4.1_e8.2_400bps_hac@v4.3.0_6mA@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_6ma_hac.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.3.0 dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_6ma_sup.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_fast.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.1.0 dna_r10.4.1_e8.2_400bps_sup@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_sup.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_fast.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.1.0 dna_r10.4.1_e8.2_400bps_sup@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_sup.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_e8_fast@v3.4_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_fast.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_e8_hac@v3.3_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_hac.cfg
dna_r9.4.1_e8_sup@v3.3 dna_r9.4.1_e8_sup@v3.3_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_sup.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_e8_fast@v3.4_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_fast.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_e8_hac@v3.3_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_hac.cfg
dna_r9.4.1_e8_sup@v3.3 dna_r9.4.1_e8_sup@v3.3_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_sup.cfg
rna004_130bps_sup@v3.0.1 rna004_130bps_sup@v3.0.1_m6A_DRACH@v1 rna_rp4_130bps_modbases_m6a_drach_sup.cfg

Dorado server 7.2.13

nucleotide models

Dorado standalone model Dorado server model
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast.cfg
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast_mk1c.cfg
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast_prom.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac_mk1c.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac_prom.cfg
dna_r10.4.1_e8.2_260bps_sup@v4.1.0 dna_r10.4.1_e8.2_260bps_sup.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.2.0 dna_r10.4.1_e8.2_400bps_5khz_fast.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.2.0 dna_r10.4.1_e8.2_400bps_5khz_fast_mk1c.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.2.0 dna_r10.4.1_e8.2_400bps_5khz_fast_prom.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r10.4.1_e8.2_400bps_5khz_hac.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r10.4.1_e8.2_400bps_5khz_hac_mk1c.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r10.4.1_e8.2_400bps_5khz_hac_prom.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.2.0 dna_r10.4.1_e8.2_400bps_5khz_sup.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast_mk1c.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast_prom.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac_mk1c.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac_prom.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.1.0 dna_r10.4.1_e8.2_400bps_sup.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_450bps_fast.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_450bps_fast_mk1c.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_450bps_fast_prom.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_450bps_hac.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_450bps_hac_mk1c.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_450bps_hac_prom.cfg
dna_r9.4.1_e8_sup@v3.3 dna_r9.4.1_450bps_sup.cfg
dna_r9.4.1_e8_sup@v3.3 dna_r9.4.1_450bps_sup_prom.cfg
rna002_70bps_fast@v3 rna_r9.4.1_70bps_fast.cfg
rna002_70bps_fast@v3 rna_r9.4.1_70bps_fast_mk1c.cfg
rna002_70bps_fast@v3 rna_r9.4.1_70bps_fast_prom.cfg
rna002_70bps_hac@v3 rna_r9.4.1_70bps_hac.cfg
rna002_70bps_hac@v3 rna_r9.4.1_70bps_hac_mk1c.cfg
rna002_70bps_hac@v3 rna_r9.4.1_70bps_hac_prom.cfg
rna004_130bps_fast@v3.0.1 rna_rp4_130bps_fast.cfg
rna004_130bps_fast@v3.0.1 rna_rp4_130bps_fast_mk1c.cfg
rna004_130bps_fast@v3.0.1 rna_rp4_130bps_fast_prom.cfg
rna004_130bps_hac@v3.0.1 rna_rp4_130bps_hac.cfg
rna004_130bps_hac@v3.0.1 rna_rp4_130bps_hac_mk1c.cfg
rna004_130bps_hac@v3.0.1 rna_rp4_130bps_hac_prom.cfg
rna004_130bps_sup@v3.0.1 rna_rp4_130bps_sup.cfg

modification models

Dorado standalone base model Dorado standalone modmodel Dorado server model
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_fast.cfg
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_fast_mk1c.cfg
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_fast_prom.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_hac_mk1c.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_hac_prom.cfg
dna_r10.4.1_e8.2_260bps_sup@v4.1.0 dna_r10.4.1_e8.2_260bps_sup@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_sup.cfg
dna_r10.4.1_e8.2_260bps_sup@v4.1.0 dna_r10.4.1_e8.2_260bps_sup@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5hmc_5mc_cg_sup_prom.cfg
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_fast.cfg
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_fast_mk1c.cfg
dna_r10.4.1_e8.2_260bps_fast@v4.1.0 dna_r10.4.1_e8.2_260bps_fast@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_fast_prom.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_hac_mk1c.cfg
dna_r10.4.1_e8.2_260bps_hac@v4.1.0 dna_r10.4.1_e8.2_260bps_hac@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_hac_prom.cfg
dna_r10.4.1_e8.2_260bps_sup@v4.1.0 dna_r10.4.1_e8.2_260bps_sup@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_sup.cfg
dna_r10.4.1_e8.2_260bps_sup@v4.1.0 dna_r10.4.1_e8.2_260bps_sup@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_260bps_modbases_5mc_cg_sup_prom.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.2.0 dna_r10.4.1_e8.2_400bps_fast@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_fast.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.2.0 dna_r10.4.1_e8.2_400bps_fast@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_fast_mk1c.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.2.0 dna_r10.4.1_e8.2_400bps_fast@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_fast_prom.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r10.4.1_e8.2_400bps_hac@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r10.4.1_e8.2_400bps_hac@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac_mk1c.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r10.4.1_e8.2_400bps_hac@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac_prom.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.2.0 dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_sup.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.2.0 dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_sup_prom.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.2.0 dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5mc_sup.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.2.0 dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_5mc_sup_prom.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.2.0 dna_r10.4.1_e8.2_400bps_sup@v4.2.0_6mA@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_6ma_sup.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.2.0 dna_r10.4.1_e8.2_400bps_sup@v4.2.0_6mA@v2 dna_r10.4.1_e8.2_400bps_5khz_modbases_6ma_sup_prom.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_fast.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_fast_mk1c.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_fast_prom.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_hac_mk1c.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_hac_prom.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.1.0 dna_r10.4.1_e8.2_400bps_sup@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_sup.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.1.0 dna_r10.4.1_e8.2_400bps_sup@v4.1.0_5mCG_5hmCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5hmc_5mc_cg_sup_prom.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_fast.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_fast_mk1c.cfg
dna_r10.4.1_e8.2_400bps_fast@v4.1.0 dna_r10.4.1_e8.2_400bps_fast@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_fast_prom.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac_mk1c.cfg
dna_r10.4.1_e8.2_400bps_hac@v4.1.0 dna_r10.4.1_e8.2_400bps_hac@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac_prom.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.1.0 dna_r10.4.1_e8.2_400bps_sup@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_sup.cfg
dna_r10.4.1_e8.2_400bps_sup@v4.1.0 dna_r10.4.1_e8.2_400bps_sup@v3.5.2_5mCG@v2 dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_sup_prom.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_e8_fast@v3.4_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_fast.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_e8_fast@v3.4_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_fast_mk1c.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_e8_fast@v3.4_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_fast_prom.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_e8_hac@v3.3_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_hac.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_e8_hac@v3.3_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_hac_mk1c.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_e8_hac@v3.3_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_hac_prom.cfg
dna_r9.4.1_e8_sup@v3.3 dna_r9.4.1_e8_sup@v3.3_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_sup.cfg
dna_r9.4.1_e8_sup@v3.3 dna_r9.4.1_e8_sup@v3.3_5mCG_5hmCG@v0 dna_r9.4.1_450bps_modbases_5hmc_5mc_cg_sup_prom.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_e8_fast@v3.4_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_fast.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_e8_fast@v3.4_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_fast_mk1c.cfg
dna_r9.4.1_e8_fast@v3.4 dna_r9.4.1_e8_fast@v3.4_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_fast_prom.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_e8_hac@v3.3_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_hac.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_e8_hac@v3.3_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_hac_mk1c.cfg
dna_r9.4.1_e8_hac@v3.3 dna_r9.4.1_e8_hac@v3.3_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_hac_prom.cfg
dna_r9.4.1_e8_sup@v3.3 dna_r9.4.1_e8_sup@v3.3_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_sup.cfg
dna_r9.4.1_e8_sup@v3.3 dna_r9.4.1_e8_sup@v3.3_5mCG@v0.1 dna_r9.4.1_450bps_modbases_5mc_cg_sup_prom.cfg
hasindu2008 commented 1 month ago

@Psy-Fer Perhaps we make a little script and put it to the repo under scripts/ and also have a page under docs/ and copy paste above tables (and periodically update).

hasindu2008 commented 1 month ago

If I understand it correctly, you might provide a table of the conversions between Dorado and Dorado-Server models. That would be very helpful based on the differences I have seen so far and point out by both of you. If I understand it correctly for example dna_r10.4.1_e8.2_400bps_sup.cfg in ont-dorado-server 7.4.12 is actually the Dorado dna_r10.4.1_e8.2_400bps_sup@v4.3.0 model?

It is the dna_r10.4.1_e8.2_400bps_5khz_sup.cfg

SBurnard commented 1 month ago

Regarding your comments and reasons about dorado alignment not being supported by supported buttery-eel.

Both of you, and all your points make sense. Particularly about the argument for a 'modular approach' whereby each developer focuses on what they do best. I had assumed the integration of minimap2 alignment with dorado would run just after basecalling, but some sort of more efficient manner i.e. it's call it, and while it's already loaded within memory it could quickly align it. Therefore the benefit of it being supported with buttery-eel would be: 1) Similar utility as dorado, so that others could more directly adapt scripts/pipelines that were used to process pod5 data, particularly for users that receive data in both formats. 2) The output being bam format (instead of sam) really helps reduce data storage size of temporary files. This could create a storage bottleneck for some users if they try to simultaneously process multiple nanopore runs. Not a dealbreaker by any means, just something users need to be conscious of, and either ensure large storage is available or prepare longer sequential processing.

That being said, I very much appreciate the additional work it would require to maintain this additional function. Furthermore, I was not aware of the potentially weird alignments caused by dorado - Could you provide a link documenting some of this? I'll also keep this in mind, and probably modify my old pod5 processing pipelines to implement minimap2 alignment after modbasecalling, just in case... This way it will match the buttery-eel pipeline I'm currently creating.

hasindu2008 commented 1 month ago

Some discussions and issues finally led to Dorado alignment in these cases if I remember correctly: https://github.com/hasindu2008/f5c/issues/172 https://github.com/hasindu2008/f5c/issues/176 https://github.com/hasindu2008/f5c/issues/178

SBurnard commented 1 month ago

@hasindu2008 Those conversion tables look great!

~Am I correct in understanding, when we run buttery-eel on NCI (using the dorado server) it gets the dorado code from the ONT server, then uses the local config (.cfg) files along with the local remora/basecall models?~ I just checked the buttery_basecaller_logs and saw it read in the local config and model files.

And thanks, I'll have a look at those links above.

Psy-Fer commented 1 month ago

Do you think if I provided uBAM (unaligned) instead of uSAM this would mitigate some of this?

I can add that as a feature update.

SBurnard commented 1 month ago

uBam as output could be useful but only if minimap2 can handle bam as input, and retain all mod flags. I know you've got uSam working as input but can that also work for uBam?

Psy-Fer commented 1 month ago

It uses samtools fastq for either uSAM or uBAM to pipe into Minimap 2. Use the -T flag in samtools and -y in minimap2 to carry flags through the formats. That's already documented.

SBurnard commented 1 month ago

Grand! I had noticed that info, I just wasn't 100% sure if it would also work for bam. In that case, uBam as output would a useful output to minimise storage size of temp files. Do you anticipate adding this feature to impact computational time by much?

Psy-Fer commented 1 month ago

Nah, though it will add a dependency. I'll add it to the list of new features.

hasindu2008 commented 1 month ago

I believe adding pysam as a dependency just for the purpose is not a great idea. The reason is that it will make installing buttery-eel more difficult for all the users, irrespective of the availability of the temporary storage space. My thinking is that if the users had space to store N parallel BLOW5/POD5 runs, then the extra size for the USAM files would not be that significant. Also, having uncompressed means that samtools fastq piping to minimap would be faster and less CPU intensive. After the Minimap step, the user can delete the USAM if they wish or can use samtools to make it a UBAM if they are planning to archive. Heng Li is not adding htslib to minimap2 for a reason and the same applies here. That is my thoughts on this and is of course open for discussion.

Note: BLOW5, SAM/BAM files do not need SSD space, can be simply put in a HDD-backed space like the scratch storage on an HPC.

SBurnard commented 1 month ago

I believe adding pysam as a dependency just for the purpose is not a great idea. The reason is that it will make installing buttery-eel more difficult for all the users.

Could you not let this be an optional extra i.e. if you want uBam output you need to install samtools or pysam separately to make use of this?

My thinking is that if the users had space to store N parallel BLOW5/POD5 runs, then the extra size for the USAM files would not be that significant.

It could still be slightly limiting even for multiple users on the same NCI project. For example, our group currently has 4.8Tb. One blow5 is 1.3Tb and produced a 719Gb SAM file, and then still takes 3/4 day to create a 175Gb aligned bam file (after which I could remove the SAM file). This alone is meaning just the one run is taking up half the space, and would be risky to even try a second. Meanwhile, other projects or team members may be needing some space to run other project. That being said, in light of the size of this data, I might ask NCI for more space. But I assume this could still be a valid issue others would don't have access to large and flexible HPC systems....

Note: BLOW5, SAM/BAM files do not need SSD space, can be simply put in a HDD-backed space like the scratch storage on an HPC.

Very good point. We currently have 1Tb scratch space which would probably need to be expanded to safely support temporary processing files for one run. I'll reach out to NCI to confirm how much this could be expanded by for us...

uncompressed means that samtools fastq piping to minimap would be faster and less CPU intensive

That's useful information and a good argument to keep uSam for input...

I guess the ultimate argument is that storage space is nowadays cheaper than increased computational power. And processing those binary files is somewhat more computationally demanding due to needing to decompress and recompress?! However, as the size of these nanopore files keeps expanding (which is one reason I appreciate you developed the blow5 format); my concern is that you've developed a really useful smaller data structure (slow5/blow5) to reduce bottlenecks and increase processing speed, only to require the output of a large/less efficient data file (albeit a potentially temporary and keeping to a community standard format). Ultimately, I believe minimap2 should more readily support binary BAM files for input in light of these ever expanding datasets... That's my two cents anyway.

hasindu2008 commented 1 month ago

@SBurnard I think James will look into a way to implement this without adding a mandatory dependency.

Yes, your points about the storage are taken. But another thing to note here is that unlike a BLOW5 file which is going to be held for longterm, this USAM is a temporary file. So while the cost for the former can add up over years, the cost for temporary files that should only last for a couple of hours is very small. In Gadi, locations such as /g/data are intended for long-term project storage and are not that cheap. This is the place where one would store the raw data and the final results, thus the compact size is important. However, the scratch storage is free as far as I know, as it is for temporary use and things not accessed frequently are automatically purged. NCI folks have been so far generous to us when large scratch storage spaces are requested, especially when we explain the size of the nanopore data and thus the need for having such a large considerable space. In this particular situation, the cost for this temporary file storage could be far less than the compute time for compressing and decompressing a file that is just accessed once.

To add to this further, if you going to store the unaligned reads in the long term, it is indeed beneficial to store it in UBAM instead of USAM. In such a case, the pipeline could have an extra samtools view command that does the conversion of the SAM in the scratch to the BAM in the g/data.

SBurnard commented 1 month ago

hi @hasindu2008

Those are very good points, that I will adopt for our work (since we have access to NCI)! Thank you.

My concern for the size of the temporary files was more towards groups that don't have as ready access to expandable space and are a bit more limited in that capacity and would like to process >1 sample. Or even groups that use the same hardware to process and run samples, which are even more likely to rapidly fill up temporary storage space. I guess I'm advocating for an alternate option to accommodate the potential storage limitation of some users. That being said, if no one else has requested this feature, and since maintenance in line with ONT changes is demanding, this is fairly of a lower priority (if even possible without dependency issues).

hasindu2008 commented 1 month ago

Yeah, I totally agree. @Psy-Fer will look into this when he gets some time, especially if multiple requests like this come.