calico / baskerville

Machine learning methods for DNA sequence analysis.
Apache License 2.0
24 stars 3 forks source link

Discrepancies Between Genome Sequences from BED Coordinates and tfrecords in Basenji Dataset #37

Open yangzhao1230 opened 1 week ago

yangzhao1230 commented 1 week ago

I have identified discrepancies between the sequences extracted from the hg38.ml.fa reference genome using BED file coordinates and those stored in the tfrecords within the Basenji dataset hosted on Google Cloud (https://console.cloud.google.com/storage/browser/basenji_barnyard?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22). This inconsistency is concerning as it affects the reliability of our data used for genomic analyses.

Detailed Observations:

  1. I retrieved sequences using the sequences.bed file from the hg38.ml.fa reference genome.

  2. Upon comparing these sequences with those recorded in the tfrecords dataset, I discovered mismatches. For example, in the validation set for human data, only two samples had sequences that matched their coordinates.

Questions:

  1. What could be causing these discrepancies between the retrieved sequences and those in the tfrecords?

  2. How is the hg38.ml.fa genome processed, and how does it differ from the standard reference genome?

Any assistance or guidance on how to verify the consistency of the data in the Google Cloud repository would be greatly appreciated.

yangzhao1230 commented 1 week ago

If it's convenient for you, could you please review the logic I used to test if the sequences are identical? My h5 files have all been converted from tfrecords, such that they should be exactly the same. I used the BED files and genome that you provided, but in the end, only two samples in the validation set had matching sequences.

import h5py
import torch
from torch.utils.data import Dataset
import numpy as np
from Bio import SeqIO
import pandas as pd
from tqdm import tqdm

ENFORMER_INPUT_LENGTH = 196_608
BASENJI_INPUT_LENGTH = 131_072

class H5Dataset(Dataset):
    def __init__(self, h5_file_path, genome_dict, df):
        self.h5_file = h5py.File(h5_file_path, 'r')
        self.sequences = self.h5_file['sequences']
        self.targets = self.h5_file['targets']
        self.genome_dict = genome_dict
        self.df = df

    def __del__(self):
        self.h5_file.close()

    @staticmethod
    def decode_one_hot(one_hot_encoded):
        """
        Decode one-hot encoded sequence to string.
        """
        mapping = np.array(['A', 'C', 'G', 'T', 'N'])
        indices = np.argmax(one_hot_encoded, axis=1)
        all_zeros = np.all(one_hot_encoded == 0, axis=1)
        indices[all_zeros] = 4 
        return ''.join(mapping[indices])

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, idx):

        # Directly read sequences provided by Basenji
        sequence_h5 = self.decode_one_hot(self.sequences[idx])
        target = torch.tensor(self.targets[idx], dtype=torch.float32)

        # Retrieve sequence from genome with coordinates provided by Basenji
        row = self.df.iloc[idx]
        chrom, start, end = row["chrom"], row["start"], row["end"]
        sequence = str(self.genome_dict[chrom].seq[start:end]).upper()

        if sequence == sequence_h5:
            self.equal_count += 1
        else:
            self.unequal_count += 1

        # Extend sequence to match Enformer input length
        median = (start + end) // 2
        enformer_start = median - ENFORMER_INPUT_LENGTH // 2
        enformer_end = median + ENFORMER_INPUT_LENGTH // 2
        enformer_sequence = str(self.genome_dict[chrom].seq[enformer_start:enformer_end]).upper()

        return {
            'sequence': enformer_sequence,
            'target': target,
        }

if __name__ == '__main__':
    # Dataset paths
    bed_path = "/blob/Data/human/sequences.bed"
    genome_path = "/blob/Data/DNA/Caduceus/hg38/hg38.ml.fa"
    train_h5_path = "/home/aiscuser/data/enformer_h5/basenji/human_train.h5"
    valid_h5_path = "/home/aiscuser/data/enformer_h5/basenji/human_valid.h5"
    test_h5_path = "/home/aiscuser/data/enformer_h5/basenji/human_test.h5"
    # Load genome
    genome_dict = SeqIO.to_dict(SeqIO.parse(genome_path, "fasta"))
    # Load bed file
    df = pd.read_csv(bed_path, sep="\t", header=None)
    df.columns = ["chrom", "start", "end", "split"]
    train_df = df[df["split"] == "train"].reset_index(drop=True)
    valid_df = df[df["split"] == "valid"].reset_index(drop=True)
    test_df = df[df["split"] == "test"].reset_index(drop=True)
    # Load datasets
    train_dataset = H5Dataset(train_h5_path, genome_dict, train_df)
    valid_dataset = H5Dataset(valid_h5_path, genome_dict, valid_df)
    test_dataset = H5Dataset(test_h5_path, genome_dict, test_df)
    # Check if sequences are equal
    for i, sample in enumerate(tqdm(test_dataset)):
        targets = sample["target"]
        max_value = targets.max().item()
        min_value = targets.min().item()
        test_max = max(test_max, max_value)
        test_min = min(test_min, min_value)
    print(f"Equal count: {test_dataset.equal_count}, Unequal count: {test_dataset.unequal_count}")
davek44 commented 4 days ago

I'm not sure that I completely understand what you're doing here. But I can assure you the sequences in the tfrecords match the hg38/m10 reference genomes. My ".ml" versions only remove chromosomes that I don't want to train on. My best guess for why you're seeing a problem is that the tfrecords contain 131072 length sequences, but Ziga trained Enformer by further extending the sequences. If that's not it, I'd convert your sequences to DNA and blat via UCSC to track down the discrepancy.

yangzhao1230 commented 2 days ago

I'm not sure that I completely understand what you're doing here. But I can assure you the sequences in the tfrecords match the hg38/m10 reference genomes. My ".ml" versions only remove chromosomes that I don't want to train on. My best guess for why you're seeing a problem is that the tfrecords contain 131072 length sequences, but Ziga trained Enformer by further extending the sequences. If that's not it, I'd convert your sequences to DNA and blat via UCSC to track down the discrepancy.

Thank you for your quick response. I would like to ask if the one-hot encoded sequences stored in your TFRecord are consistent with the encoding format described in the Enformer paper?

I am not sure if I am decoding these one-hot vectors correctly (I am using the 131k bed file and TFRecord you provided for comparison, so the lengths are consistent). image

yangzhao1230 commented 2 days ago

I may have found the issue. It appears that the order of the data in the bed file does not correspond with the order in the TFRecords. For example, the second training data entry in the TFRecords corresponds to the 256th line in the bed file.