Closed tacazares closed 3 years ago
While running the quant models we were running into with multiple errors.
1) Differences in batch sizes between signal and target arrays
ValueError: Input arrays should have the same number of samples as target arrays. Found 1000 input samples and 996 target samples.
2) Invalid interval bound error
raise ValueError("None values not supported.")
ValueError: None values not supported.
The cause seems to be that the quantitative data was mapped to the hg19 reference genome and an older version of the data. The new quantitative data will be mapped to hg38 like our ATAC-seq data. This error was discovered using the following function to loop through each ROI in the ROI files and print the regions that were accepted:
def get_ATAC_CHIP_scores(row_idx, roi_df):
roi_row = roi_df.iloc[row_idx, :]
cell_line = roi_row['Cell_Line']
chrom = roi_row['Chr']
start = int(roi_row['Start'])
end = int(roi_row['Stop'])
meta_row = meta_table[(meta_table['Cell_Line'] == cell_line)]
meta_row = meta_row.reset_index(drop=True)
signal = meta_row.loc[0, 'ATAC_Signal_File']
binding = meta_row.loc[0, 'Binding_File']
with load_bigwig(signal) as signal_stream, load_bigwig(binding) as binding_stream:
signal_array = np.array(signal_stream.values(chrom, start, end))
target_array = np.array(binding_stream.values(chrom, start, end))
return signal_array, target_array
I was then able to verify that the ChIP-seq was the source of the problem by opening up the ATAC and ChIP-seq files individually using an example region that failed.
# Load bigwig file
hepg2_chip = load_bigwig("CTCF_HepG2_RPM_log_minmax01.bw")
# Import array as numpy vector
binding_array = np.array(hepg2_chip.values("chr20", 64181767, 64182791))
which produces:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-35-2b6014d786fe> in <module>
----> 1 binding_array = np.array(hepg2_chip.values("chr20", 64181767, 64182791))
RuntimeError: Invalid interval bounds!
I then checked to see if the chromosome sizes where the same in each bigwig file. Bigwig files contain a dictionary of chromosome sizes in their header.
ATAC
hepg2_atac.chroms()
{'chr1': 248956422,
'chr10': 133797422,
'chr11': 135086622,
'chr12': 133275309,
'chr13': 114364328,
'chr14': 107043718,
'chr15': 101991189,
'chr16': 90338345,
'chr17': 83257441,
'chr18': 80373285,
'chr19': 58617616,
'chr2': 242193529,
'chr20': 64444167,
'chr21': 46709983,
'chr22': 50818468,
'chr3': 198295559,
'chr4': 190214555,
'chr5': 181538259,
'chr6': 170805979,
'chr7': 159345973,
'chr8': 145138636,
'chr9': 138394717,
'chrX': 156040895}
ChIP
hepg2_chip.chroms()
{'chr1': 249250621,
'chr10': 135534747,
'chr11': 135006516,
'chr12': 133851895,
'chr13': 115169878,
'chr14': 107349540,
'chr15': 102531392,
'chr16': 90354753,
'chr17': 81195210,
'chr18': 78077248,
'chr19': 59128983,
'chr2': 243199373,
'chr20': 63025520,
'chr21': 48129895,
'chr22': 51304566,
'chr3': 198022430,
'chr4': 191154276,
'chr5': 180915260,
'chr6': 171115067,
'chr7': 159138663,
'chr8': 146364022,
'chr9': 141213431,
'chrX': 155270560}
I was then able to validate that the ChIP-seq data was reference hg19 chrom sizes and the ATAC-seq data references hg38 chrom sizes.
The second error concerning invalid interval bounds caused the first error. At some point the error was noticed and extra code was added to skip the problematic regions. Example below:
try:
input_matrix = get_pc_input_matrix(
rows=INPUT_CHANNELS,
cols=INPUT_LENGTH,
batch_size=1, # we will combine into batch later
reshape=False,
bp_order=BP_ORDER,
signal_stream=signal_stream,
average_stream=average_stream,
sequence_stream=sequence_stream,
chrom=chrom_name,
start=start,
end=end
)
inputs_batch.append(input_matrix)
if not quant:
target_vector = np.array(binding_stream.values(chrom_name, start, end)).T
target_vector = np.nan_to_num(target_vector, 0.0)
n_bins = int(target_vector.shape[0] / bp_resolution)
split_targets = np.array(np.split(target_vector, n_bins, axis=0))
bin_sums = np.sum(split_targets, axis=1)
bin_vector = np.where(bin_sums > 0.5*bp_resolution, 1.0, 0.0)
targets_batch.append(bin_vector)
else:
target_vector = np.array(binding_stream.values(chrom_name, start, end)).T
target_vector = np.nan_to_num(target_vector, 0.0)
n_bins = int(target_vector.shape[0] / bp_resolution)
split_targets = np.array(np.split(target_vector, n_bins, axis=0))
bin_vector =np.mean(split_targets, axis=1) #Perhaps we can change np.mean to np.median. Something to think about.
targets_batch.append(bin_vector)
except:
here = 3
print(roi_row)
continue
We are are going to generate the new quant data using hg38 reference.
While generating quantitative data I found a bug in the normalization code. The parser did not include an argument for user defined chroms, so that feature was added in.
Original
with pyBigWig.open(args.signal) as input_bw, pyBigWig.open(OUTPUT_FILENAME, "w") as output_bw:
header = [(x, chromosome_length_dictionary[x]) for x in sorted(DEFAULT_CHROMS)]
output_bw.addHeader(header)
# Create a status bar for to look fancy and count what chromosome you are on
chrom_status_bar = tqdm.tqdm(total=len(DEFAULT_CHROMS), desc='Chromosomes Processed', position=0)
Modified
with pyBigWig.open(args.signal) as input_bw, pyBigWig.open(OUTPUT_FILENAME, "w") as output_bw:
header = [(x, chromosome_length_dictionary[x]) for x in sorted(args.chroms)]
output_bw.addHeader(header)
# Create a status bar for to look fancy and count what chromosome you are on
chrom_status_bar = tqdm.tqdm(total=len(args.chroms), desc='Chromosomes Processed', position=0)
The raw counts data outperformed the log normalized quant data in every metric. There might need to be some adjustments, but we are going to use RP20M_minmax01 counts from now on for quant data.
We want to test different window (slop) sizes around the Tn5 cut site and its effects on binary and quantitative prediction AUPR and R^2.
We will use 11 TFs and MACS2 peaks based on the slop20b- ATAC-seq peaks.
Slop sizes 0, 5, 10, 20, 40
Binary Tasks
Quant Tasks