Confusion over longbow correct

sagnikbanerjee15 commented 2 years ago

Hello,

I am using longbow correct to fix cell barcodes. There is a single-cell component to our experiment and we have used cell-ranger to obtain counts from 10X data that outputs a list of cell barcodes. This list in our case contains around 4.3K barcodes. The other set of cell barcodes at our disposal is the whitelist provided by 10X genomics which has over 6M barcodes. I used both sets of cell barcodes to perform correction but I am getting unexpected results when I use the barcode set with 4.3K barcodes. For both cases, I put the CCS max hamming distance to 2. I checked the results of longbow correct and found that some of the corrected barcodes have a hamming distance of over 5. I am not sure why longbow corrects those barcodes and not report those as uncorrectable. Here are some examples:

Old BC: CGGACACAGTCGTTTA New BC: CCGGACACAGTCGTTA Hamming dist: 11

Please let me know what is the recommended cell barcode list to use for correction.

Thank you

jamestwebber commented 2 years ago

correct is using Levenshtein distance (a.k.a. edit distance) rather than hamming distance. This means it considers insertions and deletions as well as mismatches.

So the example you gave is something like:

-CGGACACAGTCGTTTA
CCGGACACAGTCG-TTA

i.e. edit distance 2

sagnikbanerjee15 commented 2 years ago

Makes sense. Thank you

broadinstitute / longbow

Confusion over longbow correct #191