bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
88 stars 18 forks source link

Bacpop 17 #201

Closed muppi1993 closed 2 years ago

muppi1993 commented 2 years ago

assign.py:

web.py:

muppi1993 commented 2 years ago

Note on sketch_to_hdf5(): Since the hd5 db only takes the attributes sketch_version, codon_phased and densified once at the top level group ("sketches"), I am not sure how we should handle cases, in which the different json_sketches might differ in these attributes. I guess this is most relevant for sketch_version, if the user combines new samples with some they uploaded a while ago (Not sure if codon_phased and densifiedwill ever be different?). Currently, they will be set to the values from the last json_sketch that is processed. I am not even sure if this is relevant at all?

muppi1993 commented 2 years ago
  • densified I don't expect this to be written to the database, is there an example in the code you can point me to where this is read/written?

densified is written to the db here: https://github.com/bacpop/PopPUNK/blob/46aff5d5715b26a7582c733d6956cb4c78748a99/PopPUNK/web.py#L194

Do we ever expect that this would be different for the sketches we generate with the wasm code? If we assume they will all have the same value here, it should not be a problem to just have this value once for the db containing multiple jsons generated at different times.

johnlees commented 2 years ago
  • densified I don't expect this to be written to the database, is there an example in the code you can point me to where this is read/written?

densified is written to the db here:

https://github.com/bacpop/PopPUNK/blob/46aff5d5715b26a7582c733d6956cb4c78748a99/PopPUNK/web.py#L194

Ah I see. I think that shouldn't be written to the DB as it's not used anywhere else. Basically this happens when the sketch size is 'too big' for the genome length, but it is an adaptation of the algorithm not a warning/error. In the sketching code we write to the terminal when this occurs for a sample, but this feels like something you'd only ever want to know from the CLI when you're creating a database manually.

So I'd suggest simply removing that line and ignoring that field (and we can remove it from the web sketch too, if necessary).

Do we ever expect that this would be different for the sketches we generate with the wasm code? If we assume they will all have the same value here, it should not be a problem to just have this value once for the db containing multiple jsons generated at different times.

For any bacterial data this should always be false (and definitely always false for S. pneumoniae), but the sketchlib code is set up to handle differences so I think you can just leave this as-is.

johnlees commented 2 years ago

@muppi1993 I'm happy for this to be merged if you bump the version btw