Closed SimonaSecomandi closed 3 years ago
Hello,
It seems like the problem is with the naming of the species. Please make sure that the species names are the exact same in both the MAF as well as the GFF. Also, make sure that the MAF is sorted by the coordinate of the reference genome. I am not exactly sure why you have this issue but try shortening the species names to see if they are causing the problem. If none of these things work please send small example input files so that we can replicate the problem on our end.
Ritika
Have you solved this problem,could you tell me how to deal with it? @SimonaSecomandi
Goodmorning,
I'm trying to extract 4d sites from single chromosome MAFs with msa_view to use with phyloFit. I have a MAF file for each reference chromosome and a gff file with CDS features for each chromosome. The MAFs have the species suffix before each chromosome (e.g. "Hirundo_rustica.Chr1"), while the gff has only the chromosome name "Chr1".
Maf example: a s Hirundo_rustica.Chr1 4005 20 + 156035725 CTCGGAGGTCTTCTTCTGCG s Camarhynchus_parvulus.NC_044572.1 1061224 17 - 151975198 CTGTCAGATCTT---CCGGG s Gallus_gallus.NC_006089.5 565360 17 - 149682049 cccccatcccgt---gtcct s Molothrus_ater.NC_050479.1 136502195 17 - 148165238 CTGTCAGATGGT---CCAGG s Motacilla_alba.NC_052017.1 1187072 17 - 152195414 ATGTGAGATCTT---CTACA s Taeniopygia_guttata.NC_045000.1 196055 17 - 151896526 ATGTCAGATTCT---CCTCA
Running
msa_view Chr1.maf --in-format MAF --4d --featuresChr1_CDS.gff > Chr1_codons.ss
using the files described above, the program failed to regognize the alignment.I tried to add the species suffix also in all the gff files and it worked.
Then i ran msa_view
Chr1_codons.ss --in-format SS --out-format SS --tuple-size 1 > Chr1_sites.ss
for each chromosome.The next step would be aggregate all the chromosomes in a single ss file like so:
msa_view --aggregate Hirundo_rustica,Gallus_gallus,Ficedula_albicollis,Parus_major,Camarhynchus_parvulus,Molothrus_ater,Lonchura_striata,Taeniopygia_guttata *.sites.ss > all-4d.sites.ss
However the program failed with "Unable to determine alignment format". I believe that the errors are related to the sequences names in the MAF. Indeed, I have some concerns about the codon.ss and sites.ss files. It seems that msa_view messed out the species names. Here's the codon.ss file header for my Chr1: NSEQS = 17 LENGTH = 284354 TUPLE_SIZE = 3 NTUPLES = 9867 NAMES = Hirundo_rustica,Anc2:0,(Camarhynchus_parvulus:38,((Molothrus_ater:24,Anc6:10,Anc7:24,Anc5:3,Anc3:5,Anc1:54,Anc0;,Gallus_gallus:98,Camarhynchus_parvulus,Gallus_gallus,Molothrus_ater,Motacilla_alba,Taeniopygia_guttata,Lonchura_striata,Ficedula_albicollis,Parus_major ALPHABET = ACGT NCATS = -1
The NAMES are messed up.. there is the name of my reference species (Hirundo_rustica), an ancestor species (Anc2) and then the phylogenetic tree present in the MAFs, which is truncated.
Do you have some suggestion on how to handle the species suffix using MAFs with msa_view?
Many thanks.