buenrostrolab / slide_dna_seq_analysis

Analysis of slide-DNA-seq data in Zhao, Chiang, et al.
3 stars 0 forks source link

Correct list of barcodes #8

Closed shashwatsahay closed 10 months ago

shashwatsahay commented 1 year ago

Hi @zchiang

Sorry for being annoying, I had asked for the complete list of bead barcodes in the issue #7 but the barcode list which was sent does not match nearly 90% of the barcodes that were provided. Could please recheck if the barcodes provided were correct or not.

I am providing screen shots and the complete jupyter notebook from my jupyter notebook on how I arrived at the conclusion that something went wrong

image image image


import numpy as np 
import pandas as pd

full_beadfile='slide_dna_seq_analysis/data/human_colon_cancer_3_dna/full_BeadBarcodes.txt'
beads=list()
with open(full_beadfile) as handle:
    for line in handle:
        beads.append(line.strip().replace(',', ''))

full_beadlocation='slide_dna_seq_analysis/data/human_colon_cancer_3_dna/full_BeadLocations.txt'
bead_loc=list()
with open(full_beadlocation) as handle:
    all_coords=pd.DataFrame(np.array([[float(i ) for i in line.split(',')] for line in handle.readlines()]).T, columns=['x', 'y'])

all_coords['barcodes']=beads

import pandas as pd
coords_file=pd.read_csv('slide_dna_seq_analysis/data/human_colon_cancer_3_dna/human_colon_cancer_3_dna_191204_19.bead_locations.csv')

coords_file[coords_file.barcodes.isin(all_coords.barcodes)]

all_coords[all_coords.barcodes.isin(coords_file.barcodes)]
shashwatsahay commented 1 year ago

Also could you check the link provided in issue #1 for the H&E stain t seems to be broken

It would also be great if the same list of all barcodes could be provided for all samples.

Thanks :)

zchiang commented 11 months ago

Have you tried reverse complementing the barcodes? A quick check seems to indicate that that works for me, happy to debug further if that doesn't solve your issue.

shashwatsahay commented 11 months ago

Yes I even tried that but still couldn't get it to work....

I took the reverse complment code from the getcounts.py file

import numpy as np 
import pandas as pd

full_beadfile='slide_dna_seq_analysis/data/human_colon_cancer_3_dna/full_BeadBarcodes.txt'
beads=list()
with open(full_beadfile) as handle:
    for line in handle:
        beads.append(line.strip().replace(',', ''))

full_beadlocation='slide_dna_seq_analysis/data/human_colon_cancer_3_dna/full_BeadLocations.txt'
bead_loc=list()
with open(full_beadlocation) as handle:
    all_coords=pd.DataFrame(np.array([[float(i ) for i in line.split(',')] for line in handle.readlines()]).T, columns=['x', 'y'])

all_coords['barcodes']=beads

coords_file=pd.read_csv('slide_dna_seq_analysis/data/human_colon_cancer_3_dna/human_colon_cancer_3_dna_191204_19.bead_locations.csv')

coords_file[coords_file.barcodes.isin(all_coords.barcodes)]

all_coords[all_coords.barcodes.isin(coords_file.barcodes)]

complement = {"A":"T", "C":"G", "G":"C", "T":"A", "N": "N"}

def reverse_complement(seq):
    out = ""
    rev = seq[::-1]
    for i in range(len(rev)):
        out += complement[rev[i]]
    return out

all_coords['rev_comp_barcodes']=all_coords['barcodes'].apply(reverse_complement)

coords_file[coords_file.barcodes.isin(all_coords.rev_comp_barcodes)]

image

image image

shashwatsahay commented 11 months ago

Hey @zchiang

Any updates on the barcode matching?

Also it would be great if you could also upload the H&E stain for fig3 as welll. Thanks

shashwatsahay commented 10 months ago

Hey @zchiang

Sorry for the repeated pings again but any luck?

zchiang commented 10 months ago

Hi @shashwatsahay, thanks for your patience. We had to go back pretty far in our archival records to figure this out, but I think we have the correct files now.

I've uploaded the lists of extended barcodes and spatial locations here: https://drive.google.com/drive/folders/18jkSgXmMED_4dFId9IWze7TzbUrGje2C?usp=drive_link

The matching between the samples in the paper are as follows: mouse_cerebellum_1_dna_200114_14 -> 191118_13 mouse_liver_met_1_dna_191114_06 -> 191026_06 mouse_liver_met_1_dna_191114_05 -> 191026_05 mouse_liver_met_2_dna_200114_10 -> 191118_10 mouse_liver_met_2_rna_200102_04 -> 200102_04 human_colon_cancer_3_dna_191204_19 -> 191026_19 human_colon_cancer_4_dna_200114_13 -> 191118_13 human_colon_cancer_4_rna_200102_06 -> 200102_06

For the slide-DNA samples (191026 and 191118), the barcodes will have to be rearranged in the following order (1 indexed): [2 7 1 6 5 4 3 9 14 8 13 12 11 10]

Doing so will produce a longer list of barcodes/locations that is analogous to the original bead locations files provided, so to match them to the BAM files you will have to reverse complement them.

Lastly, when matching barcodes to the BAMs, we typically use a Hamming distance filter of 1 or 2. Additionally, it's known that the last few in situ sequenced bases on the array (e.g. bases 11 and 10 in the barcode) are of lower quality, so you may have to experiment with excludding them to get maximal matching.

zchiang commented 10 months ago

Oh, and the human colon cancer H&E uploaded is the one featured in both Fig. 3 and 4.