Test different Tn5 signal window sizes with binary and quantitative models

tacazares commented 3 years ago

We want to test different window (slop) sizes around the Tn5 cut site and its effects on binary and quantitative prediction AUPR and R^2.

We will use 11 TFs and MACS2 peaks based on the slop20b- ATAC-seq peaks.

Slop sizes 0, 5, 10, 20, 40

Binary Tasks

[x] Generate ROI
[x] Generate run meta files (use slop20 peaks)
[x] Train models
[x] Select best models
[x] Make predictions on best models
[x] Evaluate model performance and summarize

Quant Tasks

[x] Generate ROI
[x] Identify TFs and ENCODE IDs
[x] Download BAM files
[x] Filter BAM files for quality, blacklisted regions, and chromosomes.
[x] Generate read normalized bigwigs
[x] Generate min-max normalized bigwigs
[x] Generate run meta files (use slop20 peaks; quant data)
[x] Train models
[x] Select best models
[x] Make predictions on best models
[x] Evaluate model performance and summarize
[x] Log normalize RP20M counts then minmax normalize

tacazares commented 3 years ago

While running the quant models we were running into with multiple errors.

1) Differences in batch sizes between signal and target arrays

ValueError: Input arrays should have the same number of samples as target arrays. Found 1000 input samples and 996 target samples.

2) Invalid interval bound error

    raise ValueError("None values not supported.")
ValueError: None values not supported.

The cause seems to be that the quantitative data was mapped to the hg19 reference genome and an older version of the data. The new quantitative data will be mapped to hg38 like our ATAC-seq data. This error was discovered using the following function to loop through each ROI in the ROI files and print the regions that were accepted:

def get_ATAC_CHIP_scores(row_idx, roi_df):
    roi_row = roi_df.iloc[row_idx, :]

    cell_line = roi_row['Cell_Line']

    chrom = roi_row['Chr']

    start = int(roi_row['Start'])
    end = int(roi_row['Stop'])

    meta_row = meta_table[(meta_table['Cell_Line'] == cell_line)]
    meta_row = meta_row.reset_index(drop=True)

    signal = meta_row.loc[0, 'ATAC_Signal_File']
    binding = meta_row.loc[0, 'Binding_File']

    with load_bigwig(signal) as signal_stream, load_bigwig(binding) as binding_stream:

        signal_array = np.array(signal_stream.values(chrom, start, end))
        target_array = np.array(binding_stream.values(chrom, start, end))

    return signal_array, target_array

I was then able to verify that the ChIP-seq was the source of the problem by opening up the ATAC and ChIP-seq files individually using an example region that failed.

# Load bigwig file
hepg2_chip = load_bigwig("CTCF_HepG2_RPM_log_minmax01.bw")

# Import array as numpy vector
binding_array = np.array(hepg2_chip.values("chr20", 64181767, 64182791))

which produces:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-35-2b6014d786fe> in <module>
----> 1 binding_array = np.array(hepg2_chip.values("chr20", 64181767, 64182791))

RuntimeError: Invalid interval bounds!

I then checked to see if the chromosome sizes where the same in each bigwig file. Bigwig files contain a dictionary of chromosome sizes in their header.

ATAC

hepg2_atac.chroms()

{'chr1': 248956422,
 'chr10': 133797422,
 'chr11': 135086622,
 'chr12': 133275309,
 'chr13': 114364328,
 'chr14': 107043718,
 'chr15': 101991189,
 'chr16': 90338345,
 'chr17': 83257441,
 'chr18': 80373285,
 'chr19': 58617616,
 'chr2': 242193529,
 'chr20': 64444167,
 'chr21': 46709983,
 'chr22': 50818468,
 'chr3': 198295559,
 'chr4': 190214555,
 'chr5': 181538259,
 'chr6': 170805979,
 'chr7': 159345973,
 'chr8': 145138636,
 'chr9': 138394717,
 'chrX': 156040895}

ChIP

hepg2_chip.chroms()

{'chr1': 249250621,
 'chr10': 135534747,
 'chr11': 135006516,
 'chr12': 133851895,
 'chr13': 115169878,
 'chr14': 107349540,
 'chr15': 102531392,
 'chr16': 90354753,
 'chr17': 81195210,
 'chr18': 78077248,
 'chr19': 59128983,
 'chr2': 243199373,
 'chr20': 63025520,
 'chr21': 48129895,
 'chr22': 51304566,
 'chr3': 198022430,
 'chr4': 191154276,
 'chr5': 180915260,
 'chr6': 171115067,
 'chr7': 159138663,
 'chr8': 146364022,
 'chr9': 141213431,
 'chrX': 155270560}

I was then able to validate that the ChIP-seq data was reference hg19 chrom sizes and the ATAC-seq data references hg38 chrom sizes.

The second error concerning invalid interval bounds caused the first error. At some point the error was noticed and extra code was added to skip the problematic regions. Example below:

                    try:
                        input_matrix = get_pc_input_matrix(
                            rows=INPUT_CHANNELS,
                            cols=INPUT_LENGTH,
                            batch_size=1,                  # we will combine into batch later
                            reshape=False,
                            bp_order=BP_ORDER,
                            signal_stream=signal_stream,
                            average_stream=average_stream,
                            sequence_stream=sequence_stream,
                            chrom=chrom_name,
                            start=start,
                            end=end
                        )
                        inputs_batch.append(input_matrix)
                        if not quant:
                            target_vector = np.array(binding_stream.values(chrom_name, start, end)).T
                            target_vector = np.nan_to_num(target_vector, 0.0)
                            n_bins = int(target_vector.shape[0] / bp_resolution)
                            split_targets = np.array(np.split(target_vector, n_bins, axis=0))
                            bin_sums = np.sum(split_targets, axis=1)
                            bin_vector = np.where(bin_sums > 0.5*bp_resolution, 1.0, 0.0)
                            targets_batch.append(bin_vector)
                        else:
                            target_vector = np.array(binding_stream.values(chrom_name, start, end)).T
                            target_vector = np.nan_to_num(target_vector, 0.0)
                            n_bins = int(target_vector.shape[0] / bp_resolution)
                            split_targets = np.array(np.split(target_vector, n_bins, axis=0))
                            bin_vector =np.mean(split_targets, axis=1) #Perhaps we can change np.mean to np.median. Something to think about. 
                            targets_batch.append(bin_vector)

                    except:
                        here = 3
                        print(roi_row)
                        continue

We are are going to generate the new quant data using hg38 reference.

tacazares commented 3 years ago

While generating quantitative data I found a bug in the normalization code. The parser did not include an argument for user defined chroms, so that feature was added in.

Original

    with pyBigWig.open(args.signal) as input_bw, pyBigWig.open(OUTPUT_FILENAME, "w") as output_bw:
        header = [(x, chromosome_length_dictionary[x]) for x in sorted(DEFAULT_CHROMS)]

        output_bw.addHeader(header)

        # Create a status bar for to look fancy and count what chromosome you are on
        chrom_status_bar = tqdm.tqdm(total=len(DEFAULT_CHROMS), desc='Chromosomes Processed', position=0)

Modified

    with pyBigWig.open(args.signal) as input_bw, pyBigWig.open(OUTPUT_FILENAME, "w") as output_bw:
        header = [(x, chromosome_length_dictionary[x]) for x in sorted(args.chroms)]

        output_bw.addHeader(header)

        # Create a status bar for to look fancy and count what chromosome you are on
        chrom_status_bar = tqdm.tqdm(total=len(args.chroms), desc='Chromosomes Processed', position=0)

tacazares commented 3 years ago

The raw counts data outperformed the log normalized quant data in every metric. There might need to be some adjustments, but we are going to use RP20M_minmax01 counts from now on for quant data.

MiraldiLab / maxATAC

Test different Tn5 signal window sizes with binary and quantitative models #8