Open yangzhao1230 opened 1 week ago
If it's convenient for you, could you please review the logic I used to test if the sequences are identical? My h5 files have all been converted from tfrecords, such that they should be exactly the same. I used the BED files and genome that you provided, but in the end, only two samples in the validation set had matching sequences.
import h5py
import torch
from torch.utils.data import Dataset
import numpy as np
from Bio import SeqIO
import pandas as pd
from tqdm import tqdm
ENFORMER_INPUT_LENGTH = 196_608
BASENJI_INPUT_LENGTH = 131_072
class H5Dataset(Dataset):
def __init__(self, h5_file_path, genome_dict, df):
self.h5_file = h5py.File(h5_file_path, 'r')
self.sequences = self.h5_file['sequences']
self.targets = self.h5_file['targets']
self.genome_dict = genome_dict
self.df = df
def __del__(self):
self.h5_file.close()
@staticmethod
def decode_one_hot(one_hot_encoded):
"""
Decode one-hot encoded sequence to string.
"""
mapping = np.array(['A', 'C', 'G', 'T', 'N'])
indices = np.argmax(one_hot_encoded, axis=1)
all_zeros = np.all(one_hot_encoded == 0, axis=1)
indices[all_zeros] = 4
return ''.join(mapping[indices])
def __len__(self):
return len(self.targets)
def __getitem__(self, idx):
# Directly read sequences provided by Basenji
sequence_h5 = self.decode_one_hot(self.sequences[idx])
target = torch.tensor(self.targets[idx], dtype=torch.float32)
# Retrieve sequence from genome with coordinates provided by Basenji
row = self.df.iloc[idx]
chrom, start, end = row["chrom"], row["start"], row["end"]
sequence = str(self.genome_dict[chrom].seq[start:end]).upper()
if sequence == sequence_h5:
self.equal_count += 1
else:
self.unequal_count += 1
# Extend sequence to match Enformer input length
median = (start + end) // 2
enformer_start = median - ENFORMER_INPUT_LENGTH // 2
enformer_end = median + ENFORMER_INPUT_LENGTH // 2
enformer_sequence = str(self.genome_dict[chrom].seq[enformer_start:enformer_end]).upper()
return {
'sequence': enformer_sequence,
'target': target,
}
if __name__ == '__main__':
# Dataset paths
bed_path = "/blob/Data/human/sequences.bed"
genome_path = "/blob/Data/DNA/Caduceus/hg38/hg38.ml.fa"
train_h5_path = "/home/aiscuser/data/enformer_h5/basenji/human_train.h5"
valid_h5_path = "/home/aiscuser/data/enformer_h5/basenji/human_valid.h5"
test_h5_path = "/home/aiscuser/data/enformer_h5/basenji/human_test.h5"
# Load genome
genome_dict = SeqIO.to_dict(SeqIO.parse(genome_path, "fasta"))
# Load bed file
df = pd.read_csv(bed_path, sep="\t", header=None)
df.columns = ["chrom", "start", "end", "split"]
train_df = df[df["split"] == "train"].reset_index(drop=True)
valid_df = df[df["split"] == "valid"].reset_index(drop=True)
test_df = df[df["split"] == "test"].reset_index(drop=True)
# Load datasets
train_dataset = H5Dataset(train_h5_path, genome_dict, train_df)
valid_dataset = H5Dataset(valid_h5_path, genome_dict, valid_df)
test_dataset = H5Dataset(test_h5_path, genome_dict, test_df)
# Check if sequences are equal
for i, sample in enumerate(tqdm(test_dataset)):
targets = sample["target"]
max_value = targets.max().item()
min_value = targets.min().item()
test_max = max(test_max, max_value)
test_min = min(test_min, min_value)
print(f"Equal count: {test_dataset.equal_count}, Unequal count: {test_dataset.unequal_count}")
I'm not sure that I completely understand what you're doing here. But I can assure you the sequences in the tfrecords match the hg38/m10 reference genomes. My ".ml" versions only remove chromosomes that I don't want to train on. My best guess for why you're seeing a problem is that the tfrecords contain 131072 length sequences, but Ziga trained Enformer by further extending the sequences. If that's not it, I'd convert your sequences to DNA and blat via UCSC to track down the discrepancy.
I'm not sure that I completely understand what you're doing here. But I can assure you the sequences in the tfrecords match the hg38/m10 reference genomes. My ".ml" versions only remove chromosomes that I don't want to train on. My best guess for why you're seeing a problem is that the tfrecords contain 131072 length sequences, but Ziga trained Enformer by further extending the sequences. If that's not it, I'd convert your sequences to DNA and blat via UCSC to track down the discrepancy.
Thank you for your quick response. I would like to ask if the one-hot encoded sequences stored in your TFRecord are consistent with the encoding format described in the Enformer paper?
I am not sure if I am decoding these one-hot vectors correctly (I am using the 131k bed file and TFRecord you provided for comparison, so the lengths are consistent).
I may have found the issue. It appears that the order of the data in the bed file does not correspond with the order in the TFRecords. For example, the second training data entry in the TFRecords corresponds to the 256th line in the bed file.
I have identified discrepancies between the sequences extracted from the
hg38.ml.fa
reference genome using BED file coordinates and those stored in the tfrecords within the Basenji dataset hosted on Google Cloud (https://console.cloud.google.com/storage/browser/basenji_barnyard?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22). This inconsistency is concerning as it affects the reliability of our data used for genomic analyses.Detailed Observations:
I retrieved sequences using the
sequences.bed
file from thehg38.ml.fa
reference genome.Upon comparing these sequences with those recorded in the
tfrecords
dataset, I discovered mismatches. For example, in the validation set for human data, only two samples had sequences that matched their coordinates.Questions:
What could be causing these discrepancies between the retrieved sequences and those in the
tfrecords
?How is the
hg38.ml.fa
genome processed, and how does it differ from the standard reference genome?Any assistance or guidance on how to verify the consistency of the data in the Google Cloud repository would be greatly appreciated.