MasaakiU / MultiplexNanopore

Other
10 stars 2 forks source link

KeyError while calculating consensus #1

Closed zsilfen closed 1 year ago

zsilfen commented 1 year ago

Hi there,

I've been running into a KeyError when trying to put together a consensus sequence. Below is the error message:

refseq No. 0 6551 out of 7322:243: RuntimeWarning: invalid value encountered in double_scalars val *= self.calc_P_event_given_true_refseq(event, base) / bunshi

KeyError Traceback (most recent call last) in <cell line: 388>() 386 return alignment_result_2 387 --> 388 alignment_result_2 = calculate_consensus(alignment_result, param_dict)

2 frames in mixed_bases(base_list) 83 letters += b 84 print(letters) ---> 85 return letter_code_dict[letters] 86 87 def P_N_dict_dict_2_matrix(P_N_dict_dict, bases=bases):

KeyError: ''

MasaakiU commented 1 year ago

Hello,

Thank you for using SAVEMONEY. Let me ask you some questions:

If you do not want to upload your dataset on this thread, could you personally send it to me (mu84@cornell.edu)? I will not reveal any information and delete your data after solving this issue.

Thank you.

zsilfen commented 1 year ago

Sent a follow-up via email with data!

MasaakiU commented 1 year ago

The bug was fixed.

The problem was originated from not considering the case where some reads contain base calling with q-score of 1. I failed to consider such situation because the provider I am using (plasmidsaurus) always returns q-score of at least 2. The probability of returning a base call with a q-score of 1 was considered zero in the program, so it could not handle the case where it was included in the actual reads. Setting a very small pseudo probability when q-score is 1 solved the problem.

Thank you @zsilfen for pointing out the issues!

jyelland commented 4 months ago

Hi @MasaakiU,

I am also getting a key error during MSA. Based on a question you asked above, I think it is possibly because my plasmid maps contain "N" (I want to find the actual sequence using SAVEMONEY). Is there a workaround or a fix that allows N bases in the plasmid maps?

I am working on Ubuntu in a conda environment, calling savemoney.post_analysis from the command line after installing with pip. Error log is attached here: log.txt

Thanks for the software and for your help.

MasaakiU commented 4 months ago

Hello Jamie, Thank you for contacting me again and sorry for the inconvenience. Support of non-ATCG letters should be something that can be done without too much effort, and I'll definitely include the feature in the next update. I can probably do it this weekends, but if you're in a hurry you can replace "N" with (for example) "A" and just ignore "with_prior" results. "without_prior" results will tell you the correct base if the length of consecutive "N" is not too large, although the reliability will slightly decrease (because it does not use prior information during the base calling). Let me know if you have further questions!

jyelland commented 4 months ago

Thanks, @MasaakiU! The workaround you suggest sounds like it will work in the meantime, and I'll give it a try.

Hmm, well there are indeed a number of consecutive Ns because these are Nx20 barcodes, which I'd like to demultiplex using SAVEMONEY. I'll let you know how this works. I can start another issue if it's better discussed separately.

MasaakiU commented 4 months ago

Hello Jamie, Sorry for the delay! Now SAVEMONEY (ver_0.3.1) supports degenerate base symbols (A, T, C, G, K, M, R, S, W, Y, B, V, D, H, N) in plasmid maps.

If your sample contains only one type of Nx20 barcodes, SAVEMONEY should be able to tell you the barcode sequence. If you are mixing multiple barcoded samples, it may be difficult to demultiplex them by SAVEMONEY without the prior knowledge of barcodes.

However, there may be a dirty workaround, so please let me know in more details if you want to discuss this. As you wrote, it may be better to start the new issue and discuss separately!

jyelland commented 4 months ago

@MasaakiU Wow, this looks great! Very exciting to see support for linear DNA too.

The workaround you suggested above worked really well. We usually mix samples that have random barcodes, but with a unique identifying feature (e.g. a sgRNA) in each pool. So it seems like SAVEMONEY can handle that pretty well.

I am looking forward to incorporating the new version into future pipelines. I'll let you know how that goes and will open new issues as they arise :)

MasaakiU commented 4 months ago

Hi Jamie, I see, that's make sense! The sequence of sgRNA etc. should be sufficient for the classification of reads even if the barcode sequences are totally unknown. I'm happy that my package is helpful for your project, and the best of luck with your pipelines and experiments!