Empty extracted .rc file

physnano commented 1 year ago

Hi Danny,

I have used rf-genome-count to generate an .rc file (which seems to have been successful as the .rc file is fairly large ~25Gb).

However when I try to extract a particular region with either a BED file or a GTF the extracted .rc file appears empty. rf-rctools extract -a rna_targets.bed BC04_align-sorted.plus.rc -o int_anno.plus.rc -ow

This is further confirmed with the (empty) output from rf-rctools view int_anno.plus.rc > int_anno.txt.

The BED file I am using is simply: chr6 52995621 52995948

Is more information required in the BED file?

I definitely have reads in this region as I can confirm with visual inspection in IGV.

Best,

Will

dincarnato commented 1 year ago

Hi Will,

To understand the issue, it would help me if you could share the bam file, the genome fasta file, the rf-count-genome command you ran, and the output rc file(s). If you can create a minimal case (for example by extracting a small region of the bam file and by processing it with the same rf-count-genome command), that's also ok.

One question in the meantime: when you say "empty" you mean literally empty? Or that the whole region has all the counts at 0? If the former, just share with me the output rc files and i'll see what's going on.

Thanks! Danny

physnano commented 1 year ago

rf-count-genome:

    --processors 12 \
    --count-mutations \
    --fasta ${refPath} \
    --primary-only \
    --sorted \
    --overwrite \
    --min-quality 4 \
    --median-quality 4 \
    --count-mutations \
    --only-mut "T2C" \
    --output-dir ${outDir} \
    ${inPath}

The genome reference is GRCh38.p13.genome.fa (https://www.gencodegenes.org/human/release_38.html)

I have shared a minimal .bam/.bami dataset directly. Re-running rf-count-genome on this minimal example seems to produce an .rc file of identical size suggesting the issue may actually be with rf-count-genome or perhaps some my passed options? This is nanopore data so suggestions as to the --min-quality and --median-quality parameters are greatly appreciated.

physnano commented 1 year ago

Minimal .bam files for Issue #41

On Sat, Aug 26, 2023 at 7:54 PM Danny Incarnato @.***> wrote:

Hi Will,

To understand the issue, it would help me if you could share the bam file, the genome fasta file, the rf-count-genome command you ran, and the output rc file(s). If you can create a minimal case (for example by extracting a small region of the bam file and by processing it with the same rf-count-genome command), that's also ok.

One question in the meantime: when you say "empty" you mean literally empty? Or that the whole region has all the counts at 0? If the former, just share with me the output rc files and i'll see what's going on.

Thanks! Danny

— Reply to this email directly, view it on GitHub https://github.com/dincarnato/RNAFramework/issues/41#issuecomment-1694555094, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYFWLHGFYGYZFXB7MZJ4IDXXKZFHANCNFSM6AAAAAA374Y3FQ . You are receiving this because you authored the thread.Message ID: @.***>

dincarnato commented 1 year ago

Hi Will,

sorry I cannot see any attachment. Can you share the bam file in some other way, maybe via wetransfer.com? Concerning the size of the RC file, it has nothing to do with the size of the bam, but it has to do with the size of the genome.

Also, you didn't address my question: when you say "empty" you mean literally empty? Or that the whole region has all the counts at 0?

physnano commented 1 year ago

Sorry, I resent a minimal BAM file via google drive, let me know if you can see it. As for the extracted RC file I think its literally empty (I cannot even extract the region or see zeros)

dincarnato commented 1 year ago

Hi Will,

everything seems to work for me. See below:

[danny@rna-compute nanopore]$ rf-rctools extract -a file.bed BC04_align-sorted_7sk-ds.plus.rc -o int_anno.plus.rc -ow
[danny@rna-compute nanopore]$ rf-rctools view int_anno.plus.rc
chr6_52995621-52995948
ATGTGAGGGCGATCTGGCTGCGACATCTGTCACCCCATTGATCGCCAGGGTTGATTCGGCTGATCTGGCTGGCTAGGCGGGTGTCCCCTTCCTCCCTCACCGCTCCATGTGCGTCCCTCCCGAAGCTGCGCGCTCGGTCGAAGAGGACGACCATCCCCGATAGAGGAGGACCGGTCTTCGGTCAAGGGTATACGAGTAGCTGCGCTCCCCTGCTAGAACCTCCAAACAAGCTCTCAAGGTCCATTTGTAGGAGAACGTAGGGTAGTCAAGCTTCCAAGACTCCAGACACATCCAAATGAGGCGCTGCATGTGGCAGTCTGCCTTTCTT
0,102,20,106,32,106,51,25,67,101,50,50,56,48,161,216,89,131,115,162,153,67,92,63,62,71,101,230,93,72,38,123,75,75,111,96,103,295,86,40,74,82,44,94,135,182,253,98,80,83,501,206,39,75,138,288,229,140,93,110,203,44,53,50,31,94,56,65,168,96,79,95,117,247,219,176,107,137,706,137,134,144,86,91,63,87,268,61,78,103,30,42,99,67,42,23,56,11,72,56,303,112,39,76,28,45,99,151,62,141,124,91,57,54,36,35,32,50,28,59,441,52,79,89,49,71,106,77,78,47,53,49,34,205,168,119,111,388,66,65,87,89,30,124,54,38,56,74,68,97,83,66,74,149,33,40,242,579,86,120,115,114,44,115,52,34,47,45,46,40,69,148,110,83,146,54,168,101,81,335,79,66,39,147,258,178,170,81,260,99,181,193,151,103,89,51,754,212,69,109,178,97,158,101,69,112,76,90,312,98,191,163,135,183,151,43,64,88,37,35,67,40,67,54,92,106,54,85,84,72,73,84,71,122,46,149,140,68,71,85,35,37,69,129,83,102,65,164,218,30,45,49,46,98,113,73,113,156,210,88,67,82,259,209,136,220,68,132,147,275,114,143,142,69,100,88,78,38,45,30,105,72,265,125,63,56,72,69,113,57,71,51,69,36,44,46,224,66,47,34,42,91,68,53,150,64,152,59,215,70,101,185,69,100,159,81,77,29,109,55,76,108,222,41,23,10,10,1
4611,4681,4822,5059,5181,5198,5294,5353,5366,5374,5427,5434,5440,5447,5456,5481,5488,5492,5493,5508,5518,5546,5548,5550,5557,5558,5559,5559,5565,5569,5569,5570,5572,5577,5578,5579,5580,5581,5593,5594,5594,5595,5597,5598,5598,5598,5600,5665,5669,5674,5674,5680,5689,5691,5691,5693,5702,5709,5711,5712,5714,5724,5724,5724,5727,5735,5750,5750,5750,5763,5833,5839,5839,5841,5846,5856,5858,5863,5891,5897,5897,5901,5904,5910,5912,5919,5927,5933,5934,5937,5941,5944,5945,5946,5949,5951,5952,5954,5954,5954,5956,5964,5966,5968,5970,5972,5979,5980,5984,5988,6005,6006,6009,6010,6012,6014,6015,6015,6016,6016,6021,6022,6023,6024,6031,6031,6031,6035,6037,6041,6041,6045,6045,6045,6046,6052,6052,6052,6056,6160,6161,6165,6169,6169,6174,6176,6176,6177,6181,6183,6184,6186,6188,6189,6191,6191,6191,6194,6198,6198,6199,6207,6219,6220,6250,6251,6251,6259,6262,6262,6262,6263,6271,6271,6271,6273,6273,6274,6277,6298,6298,6298,6298,6299,6299,6324,6326,6326,6326,6336,6336,6337,6339,6339,6338,6337,6338,6338,6339,6336,6337,6337,6337,6338,6337,6337,6338,6339,6339,6339,6339,6340,6340,6339,6341,6342,6342,6342,6342,6342,6342,6342,6345,6345,6345,6344,6345,6345,6345,6345,6344,6344,6345,6346,6347,6347,6347,6351,6351,6352,6353,6354,6354,6352,6353,6353,6352,6352,6352,6352,6352,6352,6352,6352,6352,6352,6352,6350,6350,6348,6347,6345,6344,6344,6342,6345,6345,6345,6345,6342,6342,6342,6338,6336,6335,6335,6335,6335,6335,6334,6333,6331,6331,6327,6326,6327,6327,6325,6324,6324,6323,6323,6321,6312,6311,6310,6306,6306,6306,6304,6300,6292,6279,6275,6268,6262,6253,6243,6229,6221,6199,6170,6140,6125,6116,6091,6064,6045,6026,6003,5972,5912,5884,5850,5711,5286,5157,4805

I suspect that the issue is in the formatting of your BED file. 2 things to check:

The fields are tab-delimited and not space-delimited
The file was not generated using a Windows/Mac text editor, or, in case it was, you used the dos2unix Linux utility to convert line endings

Please see attached a test BED file. Let me know if this works.

Best, Danny

test_file.bed.gz

physnano commented 1 year ago

Hi Danny,

Thanks, was definitely a BED file formatting issue. Question related to the output though: I ran rf-count-genome with the --only_mut "T2C" option, therefore shouldn't I only be seeing mutation counts at T/U locations within the int_anno.plus.rc file? Or am I misinterpreting the -count or view output?

Best,

Will

dincarnato commented 1 year ago

Yes Will, you are theoretically correct, however is your experiment stranded or not? I believe you are passing the bam files to rf-count-genome without specifying the strandedness. Is it fair to assume that the R1 will always have the same sequence as the RNA that originated it, and the R2 will always be the reverse complement? If that is the case, you must specify the strandedness of your file as second-strand. Please refer to the manual, and below:

String	Strandedness
:u or :unstranded	The information on the genomic strand that originated the transcript is not preserved
:f or :first or :first-strand	Reads align to the strand complementary to the one that originated the transcript
:s or :second or :second-strand	Reads align to the same strand that originated the transcript

So, given the command line you specified before, your bam file should be specified as:

rf-count-genome \
    --processors 12 \
    --count-mutations \
    --fasta ${refPath} \
    --primary-only \
    --sorted \
    --overwrite \
    --min-quality 4 \
    --median-quality 4 \
    --count-mutations \
    --only-mut "T2C" \
    --output-dir ${outDir} \
    BC04_align-sorted_7sk-ds.bam:second

Alternatively, if the R1 is always the reverse-complement of the RNA and the R2 is always the RNA strand, you should use the first-strand mode. If your libraries are unstranded, the only way to do what you aim for is to perform the analysis transcriptome-level, and then to use the regular rf-count.

Please let me know if this helps! All the best,

Danny

physnano commented 1 year ago

Thanks Danny for your help, this all makes sense now.

Best,

Will

dincarnato / RNAFramework

Empty extracted .rc file #41