FRED-2 / OptiType

Precision HLA typing from next-generation sequencing data
BSD 3-Clause "New" or "Revised" License
185 stars 75 forks source link

Running OptiType for non human samples #97

Closed drtamermansour closed 4 months ago

drtamermansour commented 5 years ago

Is there a way to use our own set of MHC alleles?

b-schubert commented 4 years ago

Yes, there is. But its a bit of work.

You have to provide your own reference fasta file and have to overwrite the alleles HDF that contains meta information of the MHC alleles contained in the reference.

See the relevant files in the data folder for reference.

drtamermansour commented 4 years ago

Thanks for the response. How may I generate the alleles.h5 file? Is there a script some where that I may use to convert the alleles in fasta format to HDF?

b-schubert commented 4 years ago

The HDF5 File stores multiple data frames created from the IMGT .dat file and corresponding sequence file.

in haltyper.py there is a function called create_allele_dataframes that consumes an IMGT dat file, and two fasta files containing the HLA sequences in DNA and RNA. You can then store the data frames into an hdf5 file with the function store_dataframescontained in the same files. Please checkout the function, adjust accordingly, and make sure to name the dataframes and columns equally; otherwise, you will run into problems when running the pipeline with your files.

zlskidmore commented 4 years ago

I am also trying to mirror these file structures for canine, I was referencing the files in /data, and noticed that hla_reference_rna.fasta does not start with "ATG" it looks like the starting "GCTCCCACT" motif from hla_reference_rna.fasta is the start of exon2 for HLA00001 according to the .dat file from imgt. Is this by design?

Not sure of the relationship between the fasta's and .dat/h5 within the optitype codebase, so i'm wondering if not starting at exon2 in the rna fasta for canine might mess things up.

RysBen commented 1 year ago

@b-schubert Could you provide more detailed construction rules, especially for intron sequences?

b-schubert commented 4 months ago

this is by design, as optitype uses only exon2 and 3 due to data availability at that time. might be different for your organism.

Re introns: We imputed missing intronic information with the nearest neighbours HLA with intronic information. See the paper for more details.