biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
581 stars 92 forks source link

Add chain id specifier to pdbx.get_sequence #599

Closed tjmier closed 2 weeks ago

tjmier commented 2 weeks ago

I have often been needing to retrieve the full sequence of a specific chain using pdbx.get_sequence. Since the entity and chain_ID are often not equivalent for the structures I work with its a bit of hassle to convert it.

Here is the change I recommend to get_sequence to implement a chain_id specifier.

`

def get_sequence(pdbx_file, data_block=None, chain_id=None):

    Get the protein and nucleotide sequences from the
    ``entity_poly.pdbx_seq_one_letter_code_can`` entry.

    Supported polymer types (``_entity_poly.type``) are:
    ``'polypeptide(D)'``, ``'polypeptide(L)'``,
    ``'polydeoxyribonucleotide'``, ``'polyribonucleotide'`` and
    ``'polydeoxyribonucleotide/polyribonucleotide hybrid'``.
    Uracil is converted to Thymine.

    Parameters
    ----------
    pdbx_file : CIFFile or CIFBlock or BinaryCIFFile or BinaryCIFBlock
        The file object.
    data_block : str, optional
        The name of the data block.
        Default is the first (and most times only) data block of the
        file.
        If the data block object is passed directly to `pdbx_file`,
        this parameter is ignored.
    chain_id : str, optional
        The chain ID of the sequence to return.
        If not given, all sequences are returned.
        Default is None.

    Returns
    -------
    sequences : list of Sequence
        The protein and nucleotide sequences for each entity
        or for the specified chain.
    """
    block = _get_block(pdbx_file, data_block)

    poly_category= block["entity_poly"]
    seq_string = poly_category["pdbx_seq_one_letter_code_can"].as_array(str)
    seq_type = poly_category["type"].as_array(str)
    sequences = []
    for string, stype in zip(seq_string, seq_type):
        sequence = _convert_string_to_sequence(string, stype)
        if sequence is not None:
            sequences.append(sequence)

    if chain_id is not None:
        entity_id = poly_category['entity_id'].as_array(int)

        strand_ids = poly_category['pdbx_strand_id'].as_array(str)

        target_entity_id = None
        for i, ids in enumerate(entity_id):
            if chain_id in strand_ids[i]:
                target_entity_id = ids

        if target_entity_id is None:
            raise ValueError(f"Chain ID {chain_id} not found.")

        return sequences[target_entity_id-1]

    else:
        return sequences

`

padix-key commented 2 weeks ago

Sounds like a good idea from my side. Could you create a PR? @t0mdavid-m As you are the original author, what is your opinion?

t0mdavid-m commented 2 weeks ago

What do you think about returning a Dictionary mappingchain_id->Sequence instead of a List?

tjmier commented 2 weeks ago

I think returning a Dictionary mapping chain_id->Sequence is a good idea. I submitted a PR implementing this.