Open ekarlins opened 7 years ago
Maybe when we output the bed file, we can just add ‘chr’ in front of the numbers? It’s a one-liner at the end of the R script. I don’t see how this could mess things up. We could try that?
From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Scan2CNV reply@reply.github.com Date: Friday, March 24, 2017 at 2:32 PM To: NCBI-Hackathons/Scan2CNV Scan2CNV@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [NCBI-Hackathons/Scan2CNV] add "chr" to chromosome names in gsrc bed file (#30)
For comparing gsrc and PennCNV bed files it would make sense to have the chromosomes named the same way. Currently PennCNV uses "chr" as part of the naming and gsrc does not. "chr" is required for UCSC browser and I don't think we can (or want to) change this for PennCNV.
It should be easy to add "chr" to the gsrc output. I'm just wondering if there are any use cases where this would be bad. Like are there any species, with SNP chips available, that have chromosome names that adding "chr" to the start would make it wrong? I know some species don't stick with the numbering convention used in the human genome.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
It could mess things up: 1) if "chr" is already there. Then you'll have "chrchr3" instead of "chr3".
2) Some species have strange chromosome names. "JH564832" is the name of a chromosome in armadillo. I don't think adding "chr" before this will be appropriate.
Scan2CNV does not have to be specific to just the conventions of the human genome. That being said, I don't know all of the species that have SNP arrays that could be used for it.
I should also add that the naming of the chromosomes in the txt files made by "scripts/gtc2PennCNV.py" (the input files for the R script) are the names in the Illumina manifest. I don't know if these are consistent across all manifests in one species.
You’re right.
Maybe we can check for a consistent naming convention first before using bedtools?
From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Scan2CNV reply@reply.github.com Date: Friday, March 24, 2017 at 3:30 PM To: NCBI-Hackathons/Scan2CNV Scan2CNV@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Comment comment@noreply.github.com Subject: Re: [NCBI-Hackathons/Scan2CNV] add "chr" to chromosome names in gsrc bed file (#30)
I should also add that the naming of the chromosomes in the txt files made by "scripts/gtc2PennCNV.py" (the input files for the R script) are the names in the Illumina manifest. I don't know if these are consistent across all manifests in one species.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Yes. Maybe we should confirm how PennCNV is doing it. It could be as simple as checking if chrom starts with "chr" and if not adding it.
Sent from my iPhone
On Mar 24, 2017, at 3:56 PM, Nick Giangreco notifications@github.com wrote:
You’re right.
Maybe we can check for a consistent naming convention first before using bedtools?
From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Scan2CNV reply@reply.github.com Date: Friday, March 24, 2017 at 3:30 PM To: NCBI-Hackathons/Scan2CNV Scan2CNV@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Comment comment@noreply.github.com Subject: Re: [NCBI-Hackathons/Scan2CNV] add "chr" to chromosome names in gsrc bed file (#30)
I should also add that the naming of the chromosomes in the txt files made by "scripts/gtc2PennCNV.py" (the input files for the R script) are the names in the Illumina manifest. I don't know if these are consistent across all manifests in one species.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
I think, since for now, we are just using Illumina data we should just worry about this for Illumina SNP chips that are available. We should check chromosome names for all SNP chips from Illumina. I think we can skip the human ones for now, since I'm pretty sure they will work with this convention (check if chrom starts with "chr" if not add it). It would be good to check for the other species with SNP chips available, though.
The best way is probably downloading the csv file for the SNP chip manifest and checking the set of chromosomes in it. For the GSA array we can do that like this:
tail -n +9 GSAMD-24v1-0_20011747_A1.csv | cut -f10 -d "," | sort | uniq
0
1
10
11
12
13
14
15
16
17
18
19
2
20
21
22
3
4
5
6
7
8
9
[Controls]## this is the last line of the csv file and is not a chrom name that will end up in our data
MT
X
XY
Y
For comparing gsrc and PennCNV bed files it would make sense to have the chromosomes named the same way. Currently PennCNV uses "chr" as part of the naming and gsrc does not. "chr" is required for UCSC browser and I don't think we can (or want to) change this for PennCNV.
It should be easy to add "chr" to the gsrc output. I'm just wondering if there are any use cases where this would be bad. Like are there any species, with SNP chips available, that have chromosome names that adding "chr" to the start would make it wrong? I know some species don't stick with the numbering convention used in the human genome.