Just a note on this PR to remember. Phi-2 from MSFT uses a rotary dim (32) which is different from the dim per head (2560/32=80) which makes things a lit bit awkward, rotary embeddings are applied only to the first 32 dimensions and beyond (from 33 to 80) it's just a plain copy.
NOTE 2: I am trying to build a generic converter convert_HF.py fo now compatible with Llama, Mistral, Phi, hope to include other filters and in the end have only a single converter.
Just a note on this PR to remember. Phi-2 from MSFT uses a rotary dim (32) which is different from the dim per head (2560/32=80) which makes things a lit bit awkward, rotary embeddings are applied only to the first 32 dimensions and beyond (from 33 to 80) it's just a plain copy.
NOTE 2: I am trying to build a generic converter convert_HF.py fo now compatible with Llama, Mistral, Phi, hope to include other filters and in the end have only a single converter.