Teichlab / COVID-19paed

2 stars 2 forks source link

apply for reference data #1

Open lwlive opened 1 year ago

lwlive commented 1 year ago

Hi, Recently, I have get the access to dataset EGAD00001007718 from EGA, however all the data are cram format. If I want to covert the data to bam format, the reference genome is necessary for me. I have follow the instructions from methods part and download the human and virus genome, but failed. It seems the genome are not correct. I have downloaded the human genome file from http://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz, is it the genome version you used in this paper? if not, could you please show me the correct genome links? About the virus genome. I have downloaded these genomes but failed in converting. Could you share me the virus genome directly? Could you please help me get the correct genome so that I can convert the cram file to bam file? I am looking forward to your help! Thanks!

Wei Liu liuwei3@sysucc.org.cn

ktpolanski commented 1 year ago

The CRAMs that were uploaded to EGA are not aligned - they're just a space efficient form of unaligned read storage.

You should convert the CRAMs to FASTQs, and then you're free to align them as you see fit. This is the command I'd use locally to perform this task for a given $CRAMFILE:

samtools fastq -1 R1.fastq.gz -2 R2.fastq.gz --i1 I1.fastq.gz -n -i --index-format i8 $CRAMFILE
lwlive commented 1 year ago

Thanks for your kind and quick reply! I will have a try following your suggestion! Thanks again!

Best regards! Wei Liu

lwlive commented 1 year ago

Hi, ktpolanski We have download the data EGAD00001007718 from EGA and change the data to fastq format successfully following your commands. Thanks! However the GEX and V(D)J data of healthy PBMC which is useful for us is generated with Feature Barcoding technology. Could you please share me the Feature Barcoding information, so that we can demultiplexing the data. if it's possible, can you share me these information with my working email( liuw329@mail.sysu.edu.cn or liuwei3@sysucc.org.cn )

Wei Liu liuwei3@sysucc.org.cn State Key Laboratory of Oncology in South China Sun Yat-sen University Cancer Center Guangzhou 510060, China

ktpolanski commented 1 year ago

We passed the following file as the --feature-ref in cellranger.

citeseq.csv

lwlive commented 1 year ago

Thanks for your quick response and showing me the command information. But I am a little confused. For example, I known that C1-PBMC is a mix of four donors including NP16, NP41, NP15, NP20. Each donor corresponds to an oligonucleotide(sequence). However, there is no information of donor ID information. What do the first two columns "id" and "name" represent ("AB_"+"genesymbol")? Could you please offer me the corresponding information of donor ID and oligonucleotide(sequence)?

Thanks! Best regards! Wei Liu

ktpolanski commented 1 year ago

I'd like to direct you to "Demultiplexing and doublet removal of PBMC samples" in the manuscript.

lwlive commented 1 year ago

Hi, ktpolanski Thanks for your advices. I will read it carefully and try to figure it out! Thanks again!

lwlive commented 1 year ago

Hi, ktpolanski Sorry to bother you again! I notice that the PBMC samples were demultiplexed by souporcell and a set of known donor genotypes were offered. I known that souporcell could demultiplexed the mixed data but could not find corresponding information between data and donors. Therefore, the known genotypes is very important. But how could I get the information of donors' genotypes?

lwlive commented 1 year ago

Hi, ktpolanski! I have trying to figure it out (Demultiplexing the PBMC samples) but it still confused me. Could you give me a hand! In my knowledge, there are two way to distinguish the Mixed samples. One is the Feature Barcode methods, but the correspondence of sample name and oligonucleotide(sequence) is necessary. the other way is based on the genotypes of individuals (souporcell) but the known genotypes is necessary.
For the data G1-PBMC(NP32,NP31,PC11,AP7-post), I only need the data of donor NP32. in the citeseq.csv, the oligonucleotides are corresponding to the names which start with "AB", and NP32 is not in the names. These mean Feature Barcode methods is unusable. Could you please share me more details about the methods and data how you demultiplexed the mixed samples? That is really important for us!

Thanks you! Best regards! Wei Liu

ktpolanski commented 1 year ago

The known genotypes are not necessary if you plan your experiments accordingly. Let's assume for simplicity's sake that you've got two donors, A and B. You have one sample that's donor A only, one sample that's donor B only, and then some number of samples that are a mix of the two. You possess sufficient information to correctly identify the donors without additional genotyping.

This is how this was handled for this study. @RikLindeboom is going to get back to you with details.

lwlive commented 1 year ago

Hi, ktpolanski! Thanks for your reply and patience! I thought I have got it according to your explaining! Thanks again! According to my understanding, for example, G1-PBMC is a mixing of four donors, including NP32,NP31,PC11,AP7-post. NP32-NB_GEX which is the tissue from nasal brushing offers the known genotype information. That means I can separate the mixed PBMC according to the nasal brushing data!

Thanks you! Best regards! Wei Liu

ktpolanski commented 1 year ago

Yeah, the only element of uncertainty when talking with Rik was about the nasal data. In that case we're good and no need for Rik to pop down here :)

lwlive commented 1 year ago

Hi, ktpolanski! Thanks for you help! You do help me solve this problem! Thanks again!

RikLindeboom commented 1 year ago

Hi Wei,

Apologies for the late reply. I've been investigating if we can update the EGA submission with genotypes. You're right that for some samples we have matched nasal samples, and in these cases you can try to match those genotypes with souporcell cluster genotypes. Unfortunately, we don't have matched nasal / tracheal samples for all PBMCs, so this won't work for all samples.

While we figure out how to share the genotype vcfs that we have generated, I think the easiest solution for you would be to do a semi-supervised souporcell analysis. After you have rerun souporcell, you can compare the output with the ID labels that we provide in the freely available h5ad file on our data portal ( https://www.covid19cellatlas.org/ ). This should match quite nicely and you can then just assign the best overlapping id to each souporcell cluster.

Btw, when running souporcell please note that it doesn't always work perfectly, and you might have to tinker around with parameters to get a good deconvolution of the genotypes. We found that in some cases it's required to run souporcell with one more cluster than expected, as it appeared that sometimes noise within one sample 'overshadowed' other real genotypes.

Hope this helps for now, and I'll get in touch once our legal and data wrangling teams have advised on sharing the genotypes.

Many thanks and with best wishes, Rik

lwlive commented 1 year ago

Hi Rik G. H. Lindeboom, Thanks for your time! I will try according to your suggestion! I will keep an eye on the number of cluster in souporcell. Thanks again!

Yours! Wei Liu

nhidver commented 7 months ago

Hi Rik @RikLindeboom,

Hopping on to this thread because we are in a similar situation as Wei above, as we are interested in the GEX and V(D)J data of healthy PBMC only, which were pooled samples. I think that with your pointers on the semi-supervised souporcell analysis we can run it, however, I wanted to ask if you were allowed to share the genotype vcfs that you generated?

Many thanks!

RikLindeboom commented 6 months ago

Hi all,

Thanks for bringing this up and bearing with us. This is just to confirm that we have now send a VCF for deconvolution to EGA, so it should be available through EGA soon.

With best wishes, Rik