Closed gitUser128954 closed 1 year ago
Hi,
I do have a subcommand for scsnvmisc
that will convert the annotated pileup h5 file to a reference and alternative counts in market matrix format and a vcf file. I don't use R very much but I am sure there are libraries that can parse the vcf and market matrix files.
The pileup annotate tool also writes an annotated text file with information about each SNV (pileup_passed_snvs.txt.gz). It does require a tab separated chromsome lengths text file for the vcf header. You can generate this file from a samtools faidx
indexed file:
samtools faidx genome.fa
cut -f 1,2,3 genome.fa.fai > chrom_lengths.txt
For example, this will write all sites that do not overlap annotated RNA edits:
scsnvmisc snv2vcfmtx -r chrom_lenghts.txt -f genome.fa -o output_folder -e -c pileup_annotated.h5
This will produce:
output_folder/barcodes.txt #list of barcodes
output_folder/snvs.vcf #basic SNV vcf file
output_folder/refs.mtx #Reference count market matrix file
output_folder/alts.mtx #Alternative count market matrix file
Unfortunately, I have not done much work clustering mutations.
I think this is a great implementation of a fundamental concept with clear utility. But, how do I read the data from a "serialized python flammkuchen" file into R? Is there a clean handoff from a flammkuchen file to downstream analysis pipeline(s)? Any recommendations on how to identify cell clusters from the mutation and expression data (via R or ‘not R’)?