broadinstitute / longbow

Annotation and segmentation of MAS-seq data
https://broadinstitute.github.io/longbow/
BSD 3-Clause "New" or "Revised" License
20 stars 4 forks source link

Confusion over longbow correct #191

Closed sagnikbanerjee15 closed 1 year ago

sagnikbanerjee15 commented 1 year ago

Hello,

I am using longbow correct to fix cell barcodes. There is a single-cell component to our experiment and we have used cell-ranger to obtain counts from 10X data that outputs a list of cell barcodes. This list in our case contains around 4.3K barcodes. The other set of cell barcodes at our disposal is the whitelist provided by 10X genomics which has over 6M barcodes. I used both sets of cell barcodes to perform correction but I am getting unexpected results when I use the barcode set with 4.3K barcodes. For both cases, I put the CCS max hamming distance to 2. I checked the results of longbow correct and found that some of the corrected barcodes have a hamming distance of over 5. I am not sure why longbow corrects those barcodes and not report those as uncorrectable. Here are some examples:

Old BC: CGGACACAGTCGTTTA New BC: CCGGACACAGTCGTTA Hamming dist: 11

Please let me know what is the recommended cell barcode list to use for correction.

Thank you

jamestwebber commented 1 year ago

correct is using Levenshtein distance (a.k.a. edit distance) rather than hamming distance. This means it considers insertions and deletions as well as mismatches.

So the example you gave is something like:

-CGGACACAGTCGTTTA
CCGGACACAGTCG-TTA

i.e. edit distance 2

sagnikbanerjee15 commented 1 year ago

Makes sense. Thank you