TravisWheelerLab / Mirage

Mirage generates multiple sequence alignments of proteoforms by mapping sequences back to their coding regions on the genome and then compressing those mappings into alignments.
BSD 3-Clause "New" or "Revised" License
9 stars 4 forks source link

Outputs #50

Closed aylinsonlu closed 2 years ago

aylinsonlu commented 2 years ago

Hello Mirage Team, I would like to use your tool. I've downloaded genomes and gtf files by using your Download-Genomes.sh and Download-GTFs.sh scripts. I tried the tool for obscurin proteins with "./mirage2 obscn.fa species-guide.txt" command, however the Mirage-results/Final-MSAs directory contains only protein fasta files (unaligned). In each fasta file original protein sequence is stored. Species-MSAs/Misc/alignments directory contains same fasta files as well. Mappings directory is empty. Do I run the tool wrong? I appreciate your help. Thank you

alexander-nord commented 2 years ago

Hi Aylin,

Thanks for reaching out! It sounds like you ran the tool correctly, so I'm wondering if there might be a mismatch between the species names in "obscn.fa" and "species-guide.txt" Would you mind sending along (either as message text or a file) two things: the results of running (from the mirage directory) "./build/hsi/build/sstat obscn.fa", and the contents of the species guide (just "cat species-guide.txt" would be perfect!).

Best regards,

Alex

aylinsonlu commented 2 years ago

Hello,

Thank you very much! Here are the results of running:

the result of "./build/hsi/build/sstat obscn.fa": Number of Sequences : 8 Sequence Alphabet(s): Protein Total Number of Residues: 44914 Shortest Sequence Length: 345 Longest Sequence Length: 8886 : sp|A2AAJ9-2|OBSCN_MOUSE 345 : sp|A2AAJ9-3|OBSCN_MOUSE 732 : sp|A2AAJ9|OBSCN_MOUSE 8886 : sp|Q5VST9-2|OBSCN_HUMAN 7969 : sp|Q5VST9-3|OBSCN_HUMAN 6620 : sp|Q5VST9-5|OBSCN_HUMAN 3911 : sp|Q5VST9-6|OBSCN_HUMAN 8483 : sp|Q5VST9|OBSCN_HUMAN 7968 The result of "cat species-guide.txt": (Human,Mouse) Human /cta/users/abircan/Mirage/Genomes/Human.fa /cta/users/abircan/Mirage/GTFs/Human.gtf Mouse /cta/users/abircan/Mirage/Genomes/Mouse.fa /cta/users/abircan/Mirage/GTFs/Mouse.gtf

alexander-nord commented 2 years ago

Great!

The good news (for me) is that I don't have to debug Mirage to fix your problem. The not-as-good-news (for you) is that you'll need to fiddle with your protein sequence names. It looks like you're somewhat in-between the Mirage format and the UniProt format, so what Mirage is seeing is that all of these sequences belong to the species "sp" and their gene families are, taking the first two sequences as examples, "A2AAJ9-2" and "A2AAJ9-3" (and so on).

There are two easy fixes that I can recommend: First, to convert them into the Mirage format, you can flip the fields around so that (again, taking the first two sequences as examples) they would be "MOUSE|OBSCN|A2AAJ9-2" and "MOUSE|OBSCN|A2AAJ9-3" I think this regex get you where you need to go:

$seqname =~ /\|([^\|]+)\|([A-Z]+)\_([A-Z]+)$/;
$seqname = $3.'|'.$2.'|'.$1;

The second fix would be to add the 'OS=' and 'GN=' fields so that Mirage would treat the names as UniProt-formatted. This would look like " sp|A2AAJ9-2|OBSCN_MOUSE OS=MOUSE GN=OBSCN" and "sp|A2AAJ9-3|OBSCN_MOUSE OS=MOUSE GN=OBSCN" To pile on with the perl-y regex:

$seqname =~ /\|([A-Z]+)\_([A-Z]+)$/;
$seqname = $seqname.' OS='.$2.' GN='.$1;

The last thing you'll want to do is delete the file "obscn.fa.hsi" after you change the sequence names, so that a new index can be built on the updated names.

Hopefully that's helpful! I had a suspicion when I was writing the README that I might have weighted the value of brevity over clarity in the sequence name requirements, so it's been good for me to get this nudge to improve that portion of the README.

Let me know if any of this is unclear, or if something else doesn't go right with your tests!

aylinsonlu commented 2 years ago

Hello,

Thank you for your kind and detailed explanation! I appreciate it! My fasta headers were default Uniprot ids like "sp|Q5VST9|OBSCN_HUMAN Obscurin OS=Homo sapiens OX=9606 GN=OBSCN PE=1 SV=3" . I changed them as ">human|obscn|isoform_1 # obscurin, primary isoform, Swiss Prot accession Q5VST9", and it did work! I will check headers. Thank you!

alexander-nord commented 2 years ago

Glad that worked! Happy aligning!