ML4GLand / EUGENe

Elucidating the Utility of Genomic Elements with Neural Nets
MIT License
65 stars 4 forks source link

pp.pad_seqs_sdata error #44

Open treblegaia opened 11 months ago

treblegaia commented 11 months ago

pp.pad_seqs_sdata(sdata, length=1000,seq_var="hg38") Run the above line and I got this error: ---> [97] padded_seqs = sp.pad_seqs(seqs=sdata["seq"].values, pad=pad, pad_value=pad_value, length=length) [98] sdata[f"{seq_var}_padded"] = xr.DataArray(padded_seqs, dims=["_sequence", "length"]) --> [185] raise KeyError(key) KeyError: 'seq' https://github.com/ML4GLand/EUGENe/blob/13db749d9a639d8baf0a92f536b6dcca02e9c838/eugene/preprocess/_seqdata.py#L97C101-L97C101 Does it mean seq_var of the sdata have to be ”seq“

adamklie commented 11 months ago

Thanks for raising this issue!

Yes, pad_seqs_sdata will need updated to actually use the seq_var. For now you can run:

import seqpro as sp
import xarray as xr
padded_seqs = sp.pad_seqs(seqs=sdata["hg38"].values, length=1000)
sdata["hg38_padded"] = xr.DataArray(padded_seqs, dims=["_sequence", "length"])

or you can change the variable name to "seq" using:

sdata.rename_vars({"hg38", "seq")