Deena-B commented 5 years ago

Extract chromosome M data

Background

See DCL tutorial mitolinplan-v0 for an outline of the mitolin project plan. The plan outlines steps 1-4, which are referred to below.

Since the mitolin project focuses on mitochondrial DNA there are a number of places in our pipeline that we can get rid of excess data. (See slides for the difference between mitochondrial & nuclear DNA.)

The first, but more complicated place to remove excess data, is from the .vcf file that is generated after step 2. (See the google drive link at the bottom of mitolinplan-v0 to download an example .vcf file).

The second, and more straight forward place, is from the .fasta files that are generated after step 3. I suggest you first tackle this issue by starting with .fasta files.

Aim

Create a script that makes two new eg.fasta files from the two fasta example files called "chrMchr1-1.fasta.eg" & "chrMchr1-2.fasta.eg". These can be found in mitolin/data/gen/nguyen_nc_2018/20190613-fastas. Please put the new files in the same directory as the old files.

The new files should only have chrM and not chr1.

Document your work

Please fork & clone this repo. Check out a branch for your work, then push and make a PR for us to merge your note and files.

Add a note (can be .md or .ipynb) with your solution to mitolin/nb.

Your note should be named as follows:

DATE-issue#-shortdescription.ext

e.g.:

20190701-i02-extract-chrM-fa.md

Method

This data extraction from a text file can probably be done using the bash command sed or Python. Either is fine. If you want to try both or find another way that's great too. Please add notes to your documentation that give us some insight into your thinking.

See these Biostars discussion chains: a. "Question: how to convert a long fasta-file into many separate single fasta sequences" link b. "Question: Splitting A Fasta File" link c. "Question: How To Split A Multiple Fasta" link d. "Question: How To Split One Big Sequence File Into Multiple Files With Less Than 1000 Sequences In A Single File" link e. "Question: Split Large Fasta Into Mulitple Files, Can'T Name Them With Gi Number" link

Questions?

Please put questions related to this issue in this issue thread. If you want a quick response, post a link to your comment in this thread to Slack #deepcelllineage or DM @Deena. To join Slack enter your email address here. For questions NOT specifically related to this issue, get in touch through any of the communication methods listed in DCL's overview README.

jordwil commented 5 years ago

If we only care about chr M, You'd save some processing time by removing any non-mito chromosomes after alignment.

Ex: samtools view -hb sample.bam chrM > sampleM.bam

GATK does it's own realignment, though I don't think you're going to get significant changes to read:chromosomal mapping in this step.

Deena-B commented 5 years ago

Hi Marcello, Thanks for writing a script to generate chrM .fasta files and sharing it with me. Please push it to the repo.

Given Jordan's note above, I don't think it is a good use of your time to attempt to generate a chrM .vcf file from the longer version.

So you can see what the longer version looks like, here's a link to a google drive folder that has vcf files in them.

The file name for cell A10 is:

"recalibrated_duplicates_marked_reordered_sorted_filtered_realigned_Basal-1-2016-A10_CGAGGCTG-GCGTAAGA_L008_R1_001.fastq.gz.raw_variants.vcf"

decareano commented 5 years ago

Deena, Thanks. PR for extracting ChrM is done and in the deepcelllinage repo.

deepcelllineage / mitolin