Closed drtamermansour closed 4 months ago
Yes, there is. But its a bit of work.
You have to provide your own reference fasta file and have to overwrite the alleles HDF that contains meta information of the MHC alleles contained in the reference.
See the relevant files in the data folder for reference.
Thanks for the response. How may I generate the alleles.h5 file? Is there a script some where that I may use to convert the alleles in fasta format to HDF?
The HDF5 File stores multiple data frames created from the IMGT .dat file and corresponding sequence file.
in haltyper.py
there is a function called create_allele_dataframes
that consumes an IMGT dat file, and two fasta files containing the HLA sequences in DNA and RNA. You can then store the data frames into an hdf5 file with the function store_dataframes
contained in the same files. Please checkout the function, adjust accordingly, and make sure to name the dataframes and columns equally; otherwise, you will run into problems when running the pipeline with your files.
I am also trying to mirror these file structures for canine, I was referencing the files in /data, and noticed that hla_reference_rna.fasta
does not start with "ATG" it looks like the starting "GCTCCCACT" motif from hla_reference_rna.fasta
is the start of exon2 for HLA00001 according to the .dat file from imgt. Is this by design?
Not sure of the relationship between the fasta's and .dat/h5 within the optitype codebase, so i'm wondering if not starting at exon2 in the rna fasta for canine might mess things up.
@b-schubert Could you provide more detailed construction rules, especially for intron sequences?
this is by design, as optitype uses only exon2 and 3 due to data availability at that time. might be different for your organism.
Re introns: We imputed missing intronic information with the nearest neighbours HLA with intronic information. See the paper for more details.
Is there a way to use our own set of MHC alleles?