NOTE: I have transitioned to a new role and am no longer participating in the Vertebrate Genomes Project. As such, this software is no longer being maintained. For help with the pipeline or genome curation generally, please contact the Vertebrate Genome Lab at Rockefeller University.
TPF-less rapid curation of genomes
Biopython v1.81
gfastats v1.2.6
pandas
Before curating:
Curation:
Post-curation:
sh curation_2.0_pipe.sh -f <haplotype combined fasta> -a <PretextView generated agp>
-h help
-f combined haplotype fasta
-a haplotype agp generated from pretextview
Example: sh curation_2.0_pipe.sh -f rCycPin1.HiC.haps_combined.fasta -a rCycPin1.HiC.haps_combined.pretext.agp
8. Run hap2_hap1_ID_mapping.sh; this will run a mashmap between your hap1 and hap2 fasta files to identify any homologous pairs that aren't named the same. The output from this is a .out mashamp file and a tsv. The tsv contains the current names of hap2 chromosomes, and the names of their homolog in hap1; this is parsed from the .out file. The parsing can sometimes get confused by repetitive/similar/small/etc chromosomes, so I recommend plotting your mashmap or visualizing it in Jbrowse to ensure the pairs in the tsv are correct. Then, to generate a hap2 fasta with updated names, you can pass the hap2 fasta and the tsv output to update_mapping.rb. This will modify the names and output a new fasta. ***These two scripts were authored and kindly shared by Michael Paulini of the GRIT team at the Wellcome Sanger Institute. They are copied here for ease of access as they make substantially easier the process of renaming hap2 chromosomomes.***
10. (Suggested) Generate a pretext map for each haplotype to ensure it curated as anctipated.
11. Use chr_submission.py to generate the chr.tsv file that is necessary for NCBI submissions.
12. SUCCESS!
## Outputs
ADD THIS
## Wishlist/operations to include
- [x] Generating the chromosome file that is necessary for NCBI submissions. Will need to be able to double check for unloc pieces.
- [ ] Another program for automatically pushing the curated files to VGP S3.
- [ ] Better way to parse multiple tags
- [ ] More flexibility in placement of unlocs
- [x] Another post-processing script to quick-align and parse the results to adjust the order and orientation of Hap_2 chromosomes to match Hap_1.
- [ ] Script for checking for curation statistics; number of breaks, joins, etc.
- [ ] more flexibility in dealing with sex chromosomes so as to accomodate variable sex chromosome systems (i.e./XY1Y2, etc.)
- [ ] removing haplotigs, but they have to be painted to be removed as per the configuation right now; the proximity ligations being inserted b/c of painting aren't being removed when the haplotigs get removed
- [ ] output haplotigs to fasta
## FAQ
1. Why won't my PretextMap open in PretextView?
> Hi-res PretextMaps likely require an HPC to generate the map, but will also require a discrete GPU to open the map in PretextView because it requires 16GB of RAM (i./e/ Macbooks with the M1 chip will have this capacity).
2. Why aren't my unlocalized (unloc) sequences being named correctly?
> a. I (at this time) configured the pipeline to process unlocs placed at the end of their respective chromosome assignments. Processing unlocs placed at the beginning of the painted chromosome is more complicated, but is possible - time permitting I will go back and modify this in the future. For now ***place all unlocs at the right end of their painted chromosome***. <br>
> b. The unlocs also have to be painted. Double check to make sure they have been painted along with their assigned chromosome.