3DGenomes / TADbit

TADbit is a complete Python library to deal with all steps to analyze, model and explore 3C-based data. With TADbit the user can map FASTQ files to obtain raw interaction binned matrices (Hi-C like matrices), normalize and correct interaction matrices, identify and compare the so-called Topologically Associating Domains (TADs), build 3D models from the interaction matrices, and finally, extract structural properties from the models. TADbit is complemented by TADkit for visualizing 3D models
GNU General Public License v3.0
100 stars 61 forks source link

Adding support for Spacewalk files in tadbit model #393

Open iagomaceda opened 1 month ago

iagomaceda commented 1 month ago

Hi,

It would be nice to add support for the Spacewalk binary files as an output for 'tadbit model'.

I already coded the function needed using the ensembl of models as a base:

def write_spacewalk(model_ensemble : StructuralModels, sw_filename : str):
    """
    This function writes a spacewalk file from a model ensemble.

    Parameters
    ----------
    model_ensemble : StructuralModels
        A model ensemble object.
    sw_filename : str
        The name of the spacewalk file to be written.
    """

    region_chr = model_ensemble.description["chromosome"]
    region_start = model_ensemble.description["start"]
    region_reso = model_ensemble.description["resolution"]

    metadata = {"format" : "sw1",
                'version' : '1.0.0', 
                'author' : 'tadbit',
                'date' : str(datetime.now()),
                "name" : f"{model_ensemble.description['project']}_{model_ensemble.description['identifier']}",
                "genome" : str(model_ensemble.description['assembly'])}

    with h5py.File(sw_filename, 'w') as swbf:

        header_group = swbf.create_group('Header')
        header_group.attrs.update(metadata)
        header_group.attrs['point_type'] = 'single_point'

        root = swbf.create_group(metadata['name'])
        genomic_position_group = root.create_group('genomic_position')
        spatial_position_group = root.create_group('spatial_position')

        region_list = [[region_chr, str(region_start + (i - 1) * region_reso), str(region_start + i * region_reso)] for i in range(0, model_ensemble.nloci + 1)]

        genomic_position_group.create_dataset('regions', data=region_list)

        live_contact_map_vertices = np.empty((0, 3), dtype=float)

        for model_index, model in enumerate(model_ensemble, start=0):

            xyz = np.array([model['x'], model['y'], model['z']], dtype=float).T
            live_contact_map_vertices = np.vstack((live_contact_map_vertices, xyz))

            dataset_name = f't_{model_index}'
            spatial_position_group.create_dataset(dataset_name, data=xyz)

        root.create_dataset('live_contact_map_vertices', data=live_contact_map_vertices)

If you could add it (and the argument to get it) it would be very nice. Feel free to change variable names to fit the rest of the code.

Thanks,

Iago

zozo123 commented 1 month ago

Great idea!

Here's the binary format specification: The Spacewalk Binary File Format