CshlSiepelLab / phast

PHAST
Other
67 stars 24 forks source link

Problems extracting 4d sites with msa_view from MAF #34

Closed SimonaSecomandi closed 3 years ago

SimonaSecomandi commented 3 years ago

Goodmorning,

I'm trying to extract 4d sites from single chromosome MAFs with msa_view to use with phyloFit. I have a MAF file for each reference chromosome and a gff file with CDS features for each chromosome. The MAFs have the species suffix before each chromosome (e.g. "Hirundo_rustica.Chr1"), while the gff has only the chromosome name "Chr1".

Maf example: a s Hirundo_rustica.Chr1 4005 20 + 156035725 CTCGGAGGTCTTCTTCTGCG s Camarhynchus_parvulus.NC_044572.1 1061224 17 - 151975198 CTGTCAGATCTT---CCGGG s Gallus_gallus.NC_006089.5 565360 17 - 149682049 cccccatcccgt---gtcct s Molothrus_ater.NC_050479.1 136502195 17 - 148165238 CTGTCAGATGGT---CCAGG s Motacilla_alba.NC_052017.1 1187072 17 - 152195414 ATGTGAGATCTT---CTACA s Taeniopygia_guttata.NC_045000.1 196055 17 - 151896526 ATGTCAGATTCT---CCTCA

Running msa_view Chr1.maf --in-format MAF --4d --featuresChr1_CDS.gff > Chr1_codons.ss using the files described above, the program failed to regognize the alignment.

I tried to add the species suffix also in all the gff files and it worked.

Then i ran msa_view Chr1_codons.ss --in-format SS --out-format SS --tuple-size 1 > Chr1_sites.ss for each chromosome.

The next step would be aggregate all the chromosomes in a single ss file like so: msa_view --aggregate Hirundo_rustica,Gallus_gallus,Ficedula_albicollis,Parus_major,Camarhynchus_parvulus,Molothrus_ater,Lonchura_striata,Taeniopygia_guttata *.sites.ss > all-4d.sites.ss

However the program failed with "Unable to determine alignment format". I believe that the errors are related to the sequences names in the MAF. Indeed, I have some concerns about the codon.ss and sites.ss files. It seems that msa_view messed out the species names. Here's the codon.ss file header for my Chr1: NSEQS = 17 LENGTH = 284354 TUPLE_SIZE = 3 NTUPLES = 9867 NAMES = Hirundo_rustica,Anc2:0,(Camarhynchus_parvulus:38,((Molothrus_ater:24,Anc6:10,Anc7:24,Anc5:3,Anc3:5,Anc1:54,Anc0;,Gallus_gallus:98,Camarhynchus_parvulus,Gallus_gallus,Molothrus_ater,Motacilla_alba,Taeniopygia_guttata,Lonchura_striata,Ficedula_albicollis,Parus_major ALPHABET = ACGT NCATS = -1

The NAMES are messed up.. there is the name of my reference species (Hirundo_rustica), an ancestor species (Anc2) and then the phylogenetic tree present in the MAFs, which is truncated.

Do you have some suggestion on how to handle the species suffix using MAFs with msa_view?

Many thanks.

ramaniritika commented 3 years ago

Hello,

It seems like the problem is with the naming of the species. Please make sure that the species names are the exact same in both the MAF as well as the GFF. Also, make sure that the MAF is sorted by the coordinate of the reference genome. I am not exactly sure why you have this issue but try shortening the species names to see if they are causing the problem. If none of these things work please send small example input files so that we can replicate the problem on our end.

Ritika

bighawkin commented 5 months ago

Have you solved this problem,could you tell me how to deal with it? @SimonaSecomandi