COMBINE-lab / RapMap

Rapid sensitive and accurate read mapping via quasi-mapping
GNU General Public License v3.0
89 stars 23 forks source link

remove scaffold ? #22

Closed CTLife closed 7 years ago

CTLife commented 8 years ago

Hi, I downloaded reference genomes from Ensembl (fasta format). But there are lots of sequences with name "dna:scaffold": https://github.com/CTLife/TEMP/tree/master/RefGenomes

Such as Mouse_GRCm38 (mm10), except chromosome 1-19, Mt, X and Y; others should be removed before mapping ? https://github.com/CTLife/TEMP/blob/master/RefGenomes/Mouse_GRCm38.p4.txt

Such as Human_GRCh38.p5 (hg38), https://github.com/CTLife/TEMP/blob/master/RefGenomes/Human_GRCh38.p5.txt, there are 516 sequences. In addition to chromosome 1-22, Mt, X and Y; others (such as CHR_HG2241_PATCH and KI270728.1) should be removed before mapping ?

mdshw5 commented 8 years ago

If you're trying to map to a whole genome I would point you to COMBINE-lab/salmon#49 or the README (TLDR: RapMap is not for genomic alignments). If you have reads from cDNA, hybrid capture, or amplicons I would suggest making a GTF or GFF feature file using either transcripts, capture regions, or target amplicons and using gffread to make a reduced FASTA. If you have a BED file with your regions of interest you could also use mdshw5/pyfaidx since it includes a cli script to subset and filter FASTA files.

To answer your original question then: if the features you are sequencing from are not placed in the larger chromosome sized contigs then you'll want to keep the scaffold sequence before you subset the FASTA file before indexing with RapMap.

CTLife commented 8 years ago

OK, thank you.