SJTU-CGM / TGSRICEPAN

The codes of "the Third Generation Sequencing of 111 rice genomes reveals a huge size of novel genome sequence in the RICE PAN-genome"
8 stars 3 forks source link

Extraction of PAV gene from bam #1

Open malearimond opened 9 months ago

malearimond commented 9 months ago

Hello, I am a Master Student and currently working on creating a pangenome from 21 longread accessions. A lot of my ideas on how to create my pan gene library was inspired by your work, so thanks in advance! I have already created a the gene library which now also includes novel genes that arent part of the reference previously used. I have a question to your creation of the PAV matrix: In my case I mapped long raw reads of each accession against the library and created from that with bedtools genomecov a coverage file of the genes present in the accession. But how have you been able to create a presence absence matrix of the genes in the accessions to finally calculate the core/disepensable pangenome? Again thanks for your nice work!

Best regards Male

zhixue commented 9 months ago

Hi, I am sorry to reply you late. I wrote a python script to get the coverage of genes, and it has been add in modules of EUPAN3 now. It has not been published yet, but you can use the script from https://github.com/zhixue/EUPAN3/blob/master/lib/genecov.py (Please remember to get the longest transcript of every gene named as 'xxx_pTpG.gff' before).

After that, you can merge outputs from all samples and use awk in bash shell to get PAV with long data (e.g. mRNA region coverage >=85% and CDS region coverage >=95% as the present genes; others as the absent genes).

cat *cov | grep -v '#' | awk -F '\t' '{if ($3=="mRNA") print $1,$4,$10,$14}' | awk '{if ($3>0.85 && $4>0.95) print $1,$2,1; else print $1,$2,0}'

Finally, you can use reshape2 in R to get a PAV matrix from long data to short data. Using the population or subpopulation number, you can define the genes as core or dispensable ones.

malearimond commented 9 months ago

Hello! No problem, thank you very much for your answer! I wrote an R script that gave me a similar output, thanks anyway for your script!

What I am still wondering about, and perhaps interesting for you to discuss, is the variation in sequence coverage. In order to set a final threshold to determine whether a gene is present or not, I plotted the distribution of null coverage. In addition, I subsampled my raw data and mapped it to the transcript library with half the coverage, which resulted in about 5% fewer genes present in the subsampled data due to the lower coverage, even though it was the same data set, showing that variance in coverage leads to different gene PAV. I am thinking of subsampling all my data sets to the minimum coverage so that the bias is not too large, but if you want to discuss this topic, how did you deal with the coverage variation?

zhixue commented 9 months ago

It is a good but open question. The gene PAV coverage threshold I used is same as study before, in order to compare differnet sequencing technologies. It is necessary to adjust the threshold according to different cases (e.g. low sequencing depth less than 20x).

I also tested different sequencing platform and sequencing depth from a rice accession IR64 (because I can download several IR64 WGS data sets from public databases), and I found it has some impact on gene PAVs but little impact on gene family PAVs (Present gene family: at least one present gene member). Similiar missing situation also happens in SNP calling or other variation calling.

Subsampling is a good idea, but it depends on the lowest coverage of all samples. So filtering out the samples with low sequncing depth is necessary.