PombertLab / SYNY

The SYNY pipeline investigates synteny between species by reconstructing protein clusters from gene pairs.
MIT License
29 stars 4 forks source link

Minimap are not generating the proper figures #4

Closed sa-andre closed 3 months ago

sa-andre commented 3 months ago

Hello, I was trying the SYNY using the configs you developed before (including just a few of the scaffolds from the genome), so my command was

run_syny.pl -a *.gz -outdir SUBSET_minimap --aligner minimap -no_clus --no_circos --include names.txt

This results in blank images such as GCA_015220715_vs_GCA_904425465 mmap barplot 19 2x10 8 Spectral GCA_015220715_vs_GCA_904425465 mmap 1e5 19 2x10 8 blue

When I run using mashmap it works fine ( run_syny.pl -a *.gbff.gz -outdir SUBSET --aligner mashmap -no_clus --no_circos --include names.txt)

minimap seems to be installed and working, so I am not sure what is happening.

The error file is in attachment. error.log

If I may also ask, is it possible to further reduce the analysis to a particular range in each scaffold? I am finding it difficult to visualize similarities using the whole scaffolds, since I am trying to link synteny to a particular position (was planing to analyze like 2mb around the genes i am studing, instead of the whole 30+mb of the whole scaffold.

Pombert-JF commented 3 months ago

paf_metrics.py => ValueError: max() iterable argument is empty and the blank plots lead me to think that your PAF files are empty. Mashmap3 runs in a much smaller memory footprint than minimap2. Can you check the content of the PAF files generated by minimap2? If they are blank it most likely means that minimap2 ran out of memory. When I ran it on your dataset it peaked at about 10 Gb of RAM.

I'm thinking that adding a check for blank PAF files would be useful...

Selecting a subrange from sequences should be simple enough to implement with an input file like below. Would also help reduce minimap2 RAM usage. Will look into it tomorrow. If you send me the coordinates you want (like in the template like below), I'll test it on your dataset.

## Sequence start   end
sequence_1  1000000 3000000
sequence_2  1500000 3000000
sequence_3  5000000 7000000
sa-andre commented 3 months ago

Regarding the PAF files, one of them is empty (GCA_01... vs GCA_90...) while the other seems to have a result (GCA_90... vs GCA_01). I am currently running on my personal computer since my university have no cluster/dedicated server for analysis. But I guess you are probably right, my computer can't process this dataset (it has like 4GB RAM). Hopefully it can run using the range I am sending in attachment.

Thanks a lot! I think your package will help me in my research, even if I can only use mashmap. Hopefully one day it will be implemented on Galaxy web server, for those of us who doesnt have servers at disposal.

May I also ask which file would I need to run the protein cluster analysis? I also tried wihtout the "-no_clus" case but during the run it says there is no protein clusters and the Clusters/Diamond and Cluster/Synteny directories are empty ranges.txt

Pombert-JF commented 3 months ago

I implemented a new option --ranges to select contigs with the desired subranges, as discussed. It does reduce your dataset size substantially, which helps reduce minimap2 RAM usage. Not sure if it will run within 4Gb of RAM (OS + minimap) but the peak minimap usage was around 3Gb.

## Full dataset
LIST_NORMAL/SEQUENCES/GENOMES: total 2.4G
1.2G GCA_015220715.fasta
1.2G GCA_904425465.fasta

## Selected contigs with --include
LIST_INCLUDE/SEQUENCES/GENOMES: total 352M
149M GCA_015220715.fasta
203M GCA_904425465.fasta

## Selected contigs and subrange with --ranges
LIST_RANGES/SEQUENCES/GENOMES: total 28M
12M GCA_015220715.fasta
16M GCA_904425465.fasta

Using the following command on your dataset run_syny.pl -a *.gbff.gz -out SUBRANGES --ranges ranges.txt generates the dotplot (below) from minimap2 alignments (the horizontal line for CAJB..549 is a clear mapping artefact).

GCA_015220715_vs_GCA_904425465 mmap 1e5 19 2x10 8 blue

Pombert-JF commented 3 months ago

I implemented a check for blank annotations. You'll get a message like below in the shell when using GenBank files without annotations.

Protein file /home/jpombert/TESTS/SUBRANGES/SEQUENCES/PROTEINS/GCA_904425465.faa is blank
Protein file /home/jpombert/TESTS/SUBRANGES/SEQUENCES/PROTEINS/GCA_015220715.faa is blank

[E] All protein files are blank. Skipping cluster detection...

That information will also be included in syny.log. Blank PAF files (if generated) will also be inscribed in the log.

################################################################################################
##### Parsing data
################################################################################################

list_maker.pl:                                               started on Wed May 22 13:14:42 2024
  - GCA_015220715.1_fPygNat1.pri_genomic.gbff.gz does not contain protein annotations.
  - GCA_904425465.1_Colossoma_macropomum_genomic.gbff.gz does not contain protein annotations.
Runtime: 54 seconds                                        completed on Wed May 22 13:15:36 2024

You no longer need to skip gene cluster inferences with --no_clus if there is no gene annotation, syny will skip gene cluster inferences automatically.

To get result from gene cluster inferences, you need to provide GenBank files containing annotations. The files you used from NCBI do not contain any gene feature. e.g.:

zcat GCA_015220715.1_fPygNat1.pri_genomic.gbff.gz | head -n 130

LOCUS       CM026721            57280940 bp    DNA     linear   CON 03-NOV-2020
DEFINITION  Pygocentrus nattereri isolate fPygNat1 chromosome 1, whole genome
            shotgun sequence.
ACCESSION   CM026721 JADFAU010000000
VERSION     CM026721.1
DBLINK      BioProject: PRJNA561975
            BioSample: SAMN12623623
KEYWORDS    WGS.
SOURCE      Pygocentrus nattereri (red-bellied piranha)
  ORGANISM  Pygocentrus nattereri
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Actinopterygii; Neopterygii; Teleostei; Ostariophysi;
            Characiformes; Characoidei; Pygocentrus.
REFERENCE   1  (bases 1 to 57280940)
  AUTHORS   Myers,G., Meyer,A., Karagic,N., Pippel,M., Winkler,S., Tracey,A.,
            Wood,J., Formenti,G., Howe,K., Fedrigo,O. and Jarvis,E.D.
  TITLE     Pygocentrus nattereri (red-bellied piranha) genome, fPygNat1,
            primary haplotype
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 57280940)
  AUTHORS   Myers,G., Meyer,A., Karagic,N., Pippel,M., Winkler,S., Tracey,A.,
            Wood,J., Formenti,G., Howe,K., Fedrigo,O. and Jarvis,E.D.
  TITLE     Direct Submission
  JOURNAL   Submitted (28-OCT-2020) Vertebrate Genomes Project, G10K, 1230 York
            Avenue, New York, NY 10065, USA
COMMENT     ##Genome-Assembly-Data-START##
            Assembly Date          :: 13-MAR-2020
            Assembly Method        :: FALCON v. 2018.31.08-03.06; FALCON-Unzip
                                      v. 7.0.1.66975; purge_dups v. v1;
                                      scaff10x v. 4.1.0; Bionano Solve DLS v.
                                      3.4; Salsa HiC v. 2.2; Arrow smrtanalysis
                                      Pacbio polishing & gap filling v.
                                      SMRTLink7.0.1; longranger align v. 2.2.2;
                                      freebayes Illumina polishing v. 1.3.1;
                                      gEVAL manual curation v. 2020-03-13; VGP
                                      assembly pipeline individual v. 1.6
            Assembly Name          :: fPygNat1.pri
            Diploid                :: Principal pseudohaplotype
            Genome Representation  :: Full
            Expected Final Version :: No
            Genome Coverage        :: 77.77x
            Sequencing Technology  :: PacBio Sequel I CLR; llumina NovaSeq;
                                      Arima Genomics Hi-C; Bionano Genomics DLS
            ##Genome-Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..57280940
                     /organism="Pygocentrus nattereri"
                     /mol_type="genomic DNA"
                     /isolate="fPygNat1"
                     /db_xref="taxon:42514"
                     /chromosome="1"
                     /sex="male"
                     /tissue_type="liver"
                     /dev_stage="adult"
                     /country="Netherlands: Dejong Marinelife, Rotterdam"
                     /lat_lon="51.9244201 N 4.4777325 E"
                     /collection_date="21-Nov-2018"
                     /collected_by="Nidal Karagic"
     assembly_gap    3508951..3521614
                     /estimated_length=12664
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    10702314..10702413
                     /estimated_length=100
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    14855534..14855633
                     /estimated_length=100
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    22960637..22960836
                     /estimated_length=200
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    29746050..29746249
                     /estimated_length=200
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    34642186..34645152
                     /estimated_length=2967
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    41914221..41925849
                     /estimated_length=11629
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    41927829..41927928
                     /estimated_length=100
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    44318467..44318479
                     /estimated_length=13
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    45116458..45246249
                     /estimated_length=129792
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    45714387..45714586
                     /estimated_length=200
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    46011585..46011784
                     /estimated_length=200
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    46086850..46087049
                     /estimated_length=200
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    46439866..46439878
                     /estimated_length=13
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    46447772..46447871
                     /estimated_length=100
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
     assembly_gap    46641104..46641203
                     /estimated_length=100
                     /gap_type="within scaffold"
                     /linkage_evidence="map"
CONTIG      join(JADFAU010000015.1:1..57280940)
ORIGIN
        1 accctaaccc taaccctaac cctaacccta accctaaccc taaccctaac cctaacccta
       61 accctaaccc taaccctaac cctaacccta accctaaccc taaccctaac cctaacccta
      121 accctaaccc taaccctaac cctaacccta accctaaccc taaccctaac cctaacccta
      181 accctaaccc taaccctaac cctaacccta accctaaccc taaccctaac cctaacccta
      241 accctaaccc taaccctaac cctaacccta accctaaccc taaccctaac cctaacccta
      301 accctaaccc taaccctaac cctaacccta accctaaccc taaccctaac cctaacccta

What you want is a GenBank file that contains something like this:

     gene            <35861..>36478
                     /locus_tag="J0A71_11g22960"
     mRNA            <35861..>36478
                     /locus_tag="J0A71_11g22960"
                     /product="dihydrofolate reductase"
     CDS             35861..36478
                     /locus_tag="J0A71_11g22960"
                     /codon_start=1
                     /product="dihydrofolate reductase"
                     /protein_id="UYI28385.1"
                     /translation="MLALVVALASHRGIGNANALPWPRPLAADMAWFRTLSQSIPLIS
                     PDRIALAPSASNAVVMGRRTWDSIPSRFRPLANRINVVLSRGPARSTENTFFIQTFEA
                     LDSLPLPPSSMTFVIGGRDVYSLALQSGRPHLIFATEVFESPECDVFFPHIDWASYEK
                     RDITRDVSRLIDRTLASAFYSPETATFTENGTSFKMFLYTKPETR"
     gene            <36514..>37398
                     /locus_tag="J0A71_11g22970"
     mRNA            <36514..>37398
                     /locus_tag="J0A71_11g22970"
                     /product="thymidylate synthase"
     CDS             36514..37398
                     /locus_tag="J0A71_11g22970"
                     /codon_start=1
                     /product="thymidylate synthase"
                     /protein_id="UYI28386.1"
                     /translation="MPQDPRHPEHQYLDLVKHILENGARRMDRTGTGTLSVFGATMRF
                     SLEDNTFPLLTTRRVFYRGVVEELLFFLRGETDSKVLEKKGVRIWEKNGAKQFLQSVG
                     IDREEGDLGPIYGFQWRHFGARYETSASSYEGKGVDQIASAIAAIRANPASRRIVVSA
                     WNPTDLGSMALPPCHVLFQFNVTDGKLSCAMYQRSGDMGLGVPFNIASYSLLTILVAH
                     LTGLQPGEFVHFLGDAHVYLDHVDSLRQQVQRPPRAFPKLFVSPKGPRTEPEHFQYED
                     FELVGYDPHPAIKMNMSA"
     gene            <37449..>38831
                     /locus_tag="J0A71_11g22980"
     mRNA            <37449..>38831
                     /locus_tag="J0A71_11g22980"
                     /product="serine-glycine hydroxymethyltransferase"
     CDS             37449..38831
                     /locus_tag="J0A71_11g22980"
                     /codon_start=1
                     /product="serine-glycine hydroxymethyltransferase"
                     /protein_id="UYI28387.1"
                     /translation="MTDAREKGFWTGPLEMADPELHALICGEVERQKKTINLIASENY
                     AHQSAMEACGSVLTNKYSEGRVGERYYGGTHWVDRIELLCQKRALELFGLDPDVWGVN
                     VQPYSGSPANFAIYTAVVPPGGRIMGLDLPSGGHLTHGYKTKTRKISASSVYFDSRPY
                     TVGSNGLIDYEGLEKTFTDFLPHILICGYSAYSRDIDYKRLQSIAGRNGAFLFADISH
                     ISPLVASGLMNSPFEHCDIVMTTTQKGLRGPRGALIFYRRAVAKNGETVDLDARINFA
                     VFPMLQGGPHNHTIAGIASALLHAGTPEFAEYTRRVVENSRELCSRLQSLGLDILTGG
                     TDNHMLLVDLRSTGVDGAAVEHMCDALGISLNRNAIVGDSSPLSPSGIRVGTYAVTAR
                     GFGPEEMREVGDIIGGVVKLCREMTGGRKMSKADLHKVTSDARVMGSEQVLVLRRRVC
                     ALAEAYPIYE"
Pombert-JF commented 3 months ago

Looks like the files you want are located here under these URLs: https://www.ncbi.nlm.nih.gov/assembly/GCF_904425465.1 # Colossoma macropomum (tambaqui) https://www.ncbi.nlm.nih.gov/assembly/GCF_015220715.1 # Pygocentrus nattereri (red-bellied piranha

To download the GBFF files containing the annotated genomes.

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/904/425/465/GCF_904425465.1_Colossoma_macropomum/GCF_904425465.1_Colossoma_macropomum_genomic.gbff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/220/715/GCF_015220715.1_fPygNat1.pri/GCF_015220715.1_fPygNat1.pri_genomic.gbff.gz

Oddly enough, locus tags are absent from these annotations. Uniques IDs are instead labelled by Gene IDs. Modified list_maker.pl accordingly so that gene IDs are used as tags if locus tags are absent. Note that only the first isoform is kept if many are found.

You'd have to regenerate your ranges.txt if the contig names have changed. Might want to use different gap thresholds (e.g. --gaps 0 1 5) as well.

Pombert-JF commented 3 months ago

Added a section about memory usage to the readme explaining RAM issues with large genomes. Should help combined with the above changes. Will close this issue as resolved but let me know if you encounter issue(s).