Rcount-format with bed file

MingL196 commented 8 years ago

Hi!

I tried using your Rcount-format with a bed file downloaded from UCSC.

Although the bed file had multiple chromosomes in it, it looks like your Rcount-format only processed one chromosome and then finished.

Also, when I limited the bed file to chr10, only 972 of the 1600 unique NM / NR numbers ended up on in the output xml file.

Do you have any idea why this happened?

Thanks.

MWSchmid commented 8 years ago

Hi Ming

Could you please send me a link to the specific bed file you used (I would like to test and fix it)?

Best regards

MingL196 commented 8 years ago

Since this area only supports .txt, I added .txt to the end of each bed file.

mm10_chr10.bed.txt mm10_chr1_2.bed.txt

MingL196 commented 8 years ago

The mm10chr10.bed.txt was obtained by UCSC's table browser, table refGene, limited to chr10, output bed format. It was used to produce the numbers above (972 of the 1600 unique NM / NR_ numbers ended up in the xml output).

The mm10_chr1_2.bed.txt concatenating (UCSC's table browser, table refGene, chr1) and (UCSC's table browser, table refGene, chr2) together. Running it through through Rcount-format returned only chr1 features.

MingL196 commented 8 years ago

Lastly, I am using the linux64bit.zip compiled binary version of your program.

MWSchmid commented 8 years ago

Hi Ming

ok - thanks - I tested it. The non-unique model names in the fourth column are the problem. The formatter stops after it encounters the same model twice (that's why the rest is missing then). A workaround is to add a dot and a number (e.g., NM_102842.1 and NM_102842.2) to the names. The part before the dot will be used as gene name and the full name as transcript name. I attached a python script (had to zip it because .py cannot be uploaded) which does that (python 2.7):

addDotNumberToBed.py.zip

python addDotNumberToBed.py mm10_chr10.bed > mm10_chr10_withNum.bed python addDotNumberToBed.py mm10_chr1_2.bed > mm10_chr1_2_withNum.bed

If you use the new files, the number of top level feature numbers should be correct (1359 genes and 241 non-coding genes with the chr10 file (1600 in total), and 4350 genes and 684 non-coding genes with the ch1 and chr2 file (5034 in total)).

Let me know if you still encounter problems.

Best regards

MingL196 commented 8 years ago

It works! Thanks.

MWSchmid commented 8 years ago

Perfect - thanks for the report.

Best regards

MWSchmid / Rcount

Rcount-format with bed file #1