baoxingsong / AnchorWave

Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism and whole-genome duplication variation
MIT License
145 stars 19 forks source link

running proali from custom gff files #45

Open aleksicmil opened 1 year ago

aleksicmil commented 1 year ago

Hi,

I am trying to run anchorwave as an alignment step of the buckler lab PHG pipeline.

proali step is failing, and seems to me like the problem is that refCDS.fa (and all other files generated in the preparatory steps) are empty. AssemblyMAFFromAnchorWavePlugin from PHG runs the following command:

anchorwave gff2seq -r /path/to/MAIZE_B73.fa -i /path/to/ensembl_corn_v36-gene-model.noscaffolds.gff -o /path/to/refCDS.fa

My GFF file looks like this (just first gene is showed):

1       gramene gene    44289   49837   0       +       .       ID=Zm00001d027230;Name=Zm00001d027230;
1       gramene mRNA    44289   49837   0       +       .       ID=Zm00001d027230_T001;Parent=Zm00001d027230;Name=Zm00001d027230_T001;biotype=protein_coding;Note=Mitochondrial transcription termination factor family protein;
1       gramene polypeptide     44351   47995   0       +       .       ID=Zm00001d027230_P001;Derives_from=Zm00001d027230_T001;Name=Zm00001d027230_P001;
1       gramene exon    44289   44947   0       +       .       Name=Zm00001d027230_T001.e1;Parent=Zm00001d027230_T001;
1       gramene exon    45666   45803   0       +       .       Name=Zm00001d027230_T001.e2;Parent=Zm00001d027230_T001;
1       gramene exon    45888   46133   0       +       .       Name=Zm00001d027230_T001.e3;Parent=Zm00001d027230_T001;
1       gramene exon    46229   46342   0       +       .       Name=Zm00001d027230_T001.e4;Parent=Zm00001d027230_T001;
1       gramene exon    46451   46633   0       +       .       Name=Zm00001d027230_T001.e5;Parent=Zm00001d027230_T001;
1       gramene exon    47045   47262   0       +       .       Name=Zm00001d027230_T001.e6;Parent=Zm00001d027230_T001;
1       gramene exon    47650   48111   0       +       .       Name=Zm00001d027230_T001.e7;Parent=Zm00001d027230_T001;
1       gramene exon    48200   49247   0       +       .       Name=Zm00001d027230_T001.e8;Parent=Zm00001d027230_T001;
1       gramene exon    49330   49837   0       +       .       Name=Zm00001d027230_T001.e9;Parent=Zm00001d027230_T001;

Could you please tell me what could cause the issue with an empty output refCDS.fa file? Is there something in the GFF that is not correct? Thanks.

Best,

Milica

aleksicmil commented 1 year ago

Followup:

I continued working on this today as I figured there must be something wrong with the gff file. Since I am working with corn, I tried using ftp://ftp.ensemblgenomes.org/pub/plants/release-34/gff3/zea_mays/Zea_mays.AGPv4.34.gff3.gz from https://github.com/baoxingsong/genomeAlignment/tree/main/alignb73againstmo17. The only change I did to the file was to remove the scaffolds. So now, I am able to create refCDS.fa that is not empty and the SAM file, but when I run proali using the following command:

anchorwave proali -i Zea_mays.AGPv4.34.noscaffolds.gff3 
   -r MAIZE_B73.fa
   -as refCDS.fa 
   -a MAIZE_DK105.sam 
   -ar MAIZE_B73.sam 
   -s MAIZE_DK105.fa
   -n MAIZE_DK105_MAIZE_B73.anchorspro 
   -R 1 -Q 1 -t 4 
   -o MAIZE_DK105.maf

I am getting this error:

setupAnchorsWithanchorwave_avx2: /opt/conda/conda-bld/anchorwave_1676953230719/work/src/service/TransferGffWithNucmerResult.cpp:450: void readSam(std::vector<AlignmentMatch>&, std::ifstream&, std::map<std::__cxx11::basic_string<char>, Transcript>&, int&, const double&, double&, std::set<std::__cxx11::basic_string<char> >&, int32_t&, int32_t&, int32_t&, int32_t&, int&, bool&, int&): Assertion `transcriptHashMap[elems[0]].getPStart() == chromosomePosition' failed.
/opt/miniconda/bin/anchorwave: line 22:   184 Aborted                 (core dumped) "${EXE}_avx2" "$@"

By checking some other issues on this page (Issues #2 or #16), seems to me like the issue is still in the GFF file, which as a result creates SAM file that anchorwave can't process.

Could you please share some guidelines on how I can ensure that my GFF file is ok? Or, if there are any other things that I need to ensure before running proali?

Thank you very much in advance, your help is much appreciated.

Best,

Milica

baoxingsong commented 1 year ago
  1. Which version of AnchorWave are you using, please?
  2. Where did not get the file "MAIZE_B73.fa", is that V4?
  3. Could you delete "MAIZE_DK105_MAIZE_B73.anchorspro ", rerun and then share all the commands you were using, please?