Genentech / gReLU

gReLU is a python library to train, interpret, and apply deep learning models to DNA sequences.
https://genentech.github.io/gReLU/
MIT License
228 stars 23 forks source link

Questions about reading variants #45

Open HelloWorldLTY opened 2 months ago

HelloWorldLTY commented 2 months ago

Hi, thanks for your great work. Do you now support loading the variant from vcf files and filter the variants based on vaf, dp, dq, etc? Thanks a lot.

HelloWorldLTY commented 2 months ago

Also, I wonder if it is possible to read variants like inserting rather than replacement. It seems that the current design cannot handle alternative with different length.

File ~/.conda/envs/evo/lib/python3.11/site-packages/grelu/data/dataset.py:599, in VariantDataset._load_alleles(self, variants)
    597 def _load_alleles(self, variants: pd.DataFrame) -> None:
    598     self.ref = strings_to_indices(variants.ref.tolist())
--> 599     self.alt = strings_to_indices(variants.alt.tolist())

File ~/.conda/envs/evo/lib/python3.11/site-packages/grelu/sequence/format.py:251, in strings_to_indices(strings, add_batch_axis)
    247         return arr
    249 # Convert multiple sequences; they must all have equal length
    250 else:
--> 251     assert check_equal_lengths(
    252         strings
    253     ), "All input sequences must have the same length."
    254     return np.stack(
    255         [[BASE_TO_INDEX_HASH[base] for base in string] for string in strings]
    256     ).astype(np.int8)

AssertionError: All input sequences must have the same length.

Thanks a lot.

avantikalal commented 2 months ago

Hi @HelloWorldLTY, thanks for raising these points. We do not currently support VCF reading or indels, but we are working on indel support and hope to add it soon.

HelloWorldLTY commented 2 months ago

Thanks, the current best plan I have is to iteratively assign different calling object vr for each sequence and map multiple inserts. It will be very helpful to have such functions.