GoekeLab / m6anet

Detection of m6A from direct RNA-Seq data
https://m6anet.readthedocs.io/
MIT License
104 stars 19 forks source link

Empty output after m6anet dataprep #113

Open Albertperlas opened 1 year ago

Albertperlas commented 1 year ago

I am trying to obtain m6a modifications from direct RNA sequencing of the avian influenza virus. It seems I have been able to obtain the eventalign.txt file... but after running m6anet dataprep the output files are empty. Do you see any error in the eventalign.txt file??

Screenshot 2023-05-09 at 15 14 06 image
chrishendra93 commented 1 year ago

hi @Albertperlas, apology for the late reply, I've been busy at work. Can I check with you the size of eventalign.txt, the command that you used to run nanopolish eventalign and the first few lines of eventalign.index file?

Albertperlas commented 1 year ago

hi, @chrishendra93 don't worry at all about your late reply. The size of the eventalign file is 1.2GB. The command used was this one:

nanopolish eventalign --reads ./concatenatedHAC.fastq --bam ./output.sorted.bam --genome ./reference_all_segments_chr.fasta --scale-events --signal-index --summary ./vRNAhac/sequencing_summary.txt --threads 50 > ./AIVeventalign.txt

And here you can see the first lines of eventalign.index file:

image

Thank you for your help!

chrishendra93 commented 1 year ago

hi @Albertperlas, your eventalign.index looks fine, so I'm a bit perplexed with this eror. Could it be that you don't have enough reads per position? Can you try to set --min_segment_count argument for dataprep to 1 and see whether data.info and data.json is still empty?

Albertperlas commented 1 year ago

Hi, @chrishendra93 , sorry for my late reply. I have tried to use dataprep with --min_segment_count argument to 1, however, I found the same: data.info and data.json empty. There is this warning message:

/home/haicu/albert.perlas/miniconda3/envs/mamba_m6anet/lib/python3.8/site-packages/m6anet/utils/dataprep_utils.py:245: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. for chunk in pd.read_csv(eventalign_filepath, chunksize=chunk_size,sep='\t'):

baishengjun commented 1 year ago

Hi, encounter the same error. I run nanopolish with additional parameter: --samples

Albertperlas commented 1 year ago

Hi,

Thanks for the suggestion. I have run nanopolish with the additional parameter: --samples, and used the output .txt for m6anet dataprep but now this error appears and the job runs forever:

`/home/haicu/albert.perlas/miniconda3/envs/mamba_m6anet/lib/python3.8/site-packages/m6anet/utils/dataprep_utils.py:245: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. for chunk in pd.read_csv(eventalign_filepath, chunksize=chunk_size,sep='\t'): Traceback (most recent call last): File "pandas/_libs/lib.pyx", line 2280, in pandas._libs.lib.maybe_convert_numeric ValueError: Unable to parse string "100.896,114.227,125.122,125.265,124.835,123.258,138.023,107.203,97.1688,104.91,120.105,99.6057,127.415,113.511,98.0289"

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/haicu/albert.perlas/miniconda3/envs/mamba_m6anet/bin/m6anet", line 10, in sys.exit(main()) File "/home/haicu/albert.perlas/miniconda3/envs/mamba_m6anet/lib/python3.8/site-packages/m6anet/init.py", line 30, in main args.func(args) File "/home/haicu/albert.perlas/miniconda3/envs/mamba_m6anet/lib/python3.8/site-packages/m6anet/scripts/dataprep.py", line 68, in main parallel_preprocess_tx(args.eventalign, args.out_dir, args.n_processes, File "/home/haicu/albert.perlas/miniconda3/envs/mamba_m6anet/lib/python3.8/site-packages/m6anet/utils/dataprep_utils.py", line 383, in parallel_preprocess_tx data = combine(events_str) File "/home/haicu/albert.perlas/miniconda3/envs/mamba_m6anet/lib/python3.8/site-packages/m6anet/utils/dataprep_utils.py", line 291, in combine eventalign_result.loc[:, 'length'] = pd.to_numeric(eventalign_result['end_idx']) - \ File "/home/haicu/albert.perlas/miniconda3/envs/mamba_m6anet/lib/python3.8/site-packages/pandas/core/tools/numeric.py", line 217, in to_numeric values, new_mask = lib.maybe_convert_numeric( # type: ignore[call-overload] # noqa File "pandas/_libs/lib.pyx", line 2322, in pandas._libs.lib.maybe_convert_numeric ValueError: Unable to parse string "100.896,114.227,125.122,125.265,124.835,123.258,138.023,107.203,97.1688,104.91,120.105,99.6057,127.415,113.511,98.0289" at position 0 `

Thanks, Albert

Albertperlas commented 1 year ago

Hi, @chrishendra93 , sorry for my late reply. I have tried to use dataprep with --min_segment_count argument to 1, however, I found the same: data.info and data.json empty. There is this warning message:

/home/haicu/albert.perlas/miniconda3/envs/mamba_m6anet/lib/python3.8/site-packages/m6anet/utils/dataprep_utils.py:245: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. for chunk in pd.read_csv(eventalign_filepath, chunksize=chunk_size,sep='\t'):

Hi, I have gone back to here, fixed the eventalign.txt file at column 0 to avoid mixed types and then run again dataprep with --min_segment_count to 1, now I don't have any error but still output file empty :(

chrishendra93 commented 1 year ago

hi @Albertperlas and @baishengjun the warning should not affect the result I think. This seems to be related to the nanopolish eventalign.txt file but I am not sure what is the exact cause of this. I haven't tried running m6anet dataprep on genomic coordinate before so this could be the reason, but I cannot be sure unless I tried it myself. Is this a public dataset or your dataset? Do you think you can try mapping this to the transcriptome instead before running m6anet dataprep again?

Thanks!

Albertperlas commented 1 year ago

hi @chrishendra93, I will describe all my steps, let's see if you find any errors:

I have sequenced a sample with a high concentration of avian influenza virus, a segmented single-stranded RNA virus by direct RNA seq.

First I have base called my fast5 files with guppy:

./ont-guppy/bin/guppy_basecaller \ -i "$input_fast5_dir" \ -r -s "$output_fastq_file" \ -c rna_r9.4.1_70bps_hac.cfg \ -x "cuda:0" Then I concatenated my fastq files and map the concatenated file to the reference of the virus sequenced (known reference from NCBI. As it is a segmented virus, I have concatenated the references from all the segments into one fasta file, used here as a reference for the mapping using minimap2:

minimap2 -ax splice -uf -k7 "$reference_file" "$output_file" > aln.sam

I have converted the output sam file into a sorted bam file using samtools and then prepare the data using nanopolish:

First I used nanopolish index in our RNA fastq file -->

nanopolish index -d /path/to/directory/with/raw/fast5 /path/to/fastq/file.fastq

Secondly I used nanopolish eventalign -->

nanopolish eventalign --reads /path/to/fastq/file.fastq --bam /path/to/sorted/bam/file.bam --genome /path/to/reference/file/italy_ref_plot.fasta --scale-events --signal-index --summary /path/to/output/summary.txt --threads 50 > ../path/to/output/file/output/AIVeventalign.txt

Finally, I used m6anet dataprep:

m6anet dataprep --eventalign ./AIVeventalign.txt --out_dir ./ --n_processes 4

This generates all the output files, without errors but empty... Do you notice any steps that I should change??

Thanks again!

Albertperlas commented 1 year ago

Hi @chrishendra93, I also share with you one of my fast5 files in case you want to test alternative options:

https://drive.google.com/file/d/1RrYReeM6AWNmPJLBx6LfntSkYNLhbm10/view?usp=sharing

Thanks again, Albert

chrishendra93 commented 1 year ago

hi @Albertperlas, apology for the delay, I have been quite busy at work. Thanks for supplying me with the fast5 files, I would love to troubleshoot this, but in order to do so, can I trouble you again to supply the following things the bamfiles as well so that I can try running nanopolish, m6anet, and try to reproduce your error

Albertperlas commented 1 year ago

Hi @chrishendra93 , here you can find all fast5 files, the concatenated fastq file after base calling (max_vRNA.fastq), the filtered fastq file after removing short reads <50 (filtered.fastq), and the sorted bam file (sorted.bam) after mapping against the reference (italy_ref_plot.fasta): https://drive.google.com/drive/folders/1YT6NRLepZ4ak2cO5dWK0SU9vr_zuEbAi?usp=sharing

Thanks again!

Albertperlas commented 1 year ago

Hi @chrishendra93, sorry to bother you again with this, but do you have any news on this issue? Thanks!

mumdark commented 1 year ago

Hi @chrishendra93, sorry to bother you again with this, but do you have any news on this issue? Thanks!

I encountered the same issue. Some sample files are empty, while others are not.

gabbyglee commented 11 months ago

Hi @chrishendra93, I am having the same issue. It works fine with some of my datasets, but I got empty data.json, data.log and data.info with only header when I tired to run m6anet dataprep with other dataset. Please let me know if you have some suggestions! Thanks.

AlexFitzgeraldBryan commented 9 months ago

Hi @chrishendra93 ,

I'm going to preface this by saying i'm not a bioinformatician, in case what I have is stupid. I've been having the same issue also working on negative sense viral RNA...which I think might be an element of the problem. I figured the issue was likely something to do with the eventalign.txt file as the test data was working fine, so i spent an afternoon trying to break the eventalign.txt file manually. When I combined parts of the test data with my own, the dataprep process was terminating as soon as it hit my data, the only difference I could discern was that the test data had the same sequence in both the reference and the model kmer columns and my data contained a reverse complement sequence in the reference_kmer. This looks to be the same in your data @Albertperlas, which makes sense if you are working on influenza genomes as well as mRNA.

When I filter my eventalign.txt files to exclude any columns where the reference and model kmers were not matching, m6anet was able to carry out the dataprep and inference steps "normally", resulting in filled data.json and data.log files, however this alone isn't really a fix as you may filter out huge amounts of data.

My guess is that this might have something to do with the way that nanopolish is handling the read orientation of the sequence and that the negative sense RNA genomes are having their reference_kmer reversed for alignment to the genome, something that would work with mRNA but doesn't with negative sense genomes.

I've been trying to work out why this might be happening and found the combine() function in the datapre_utils.py file. I think line 279 ( cond_successfully_eventaligned = eventalign_result['reference_kmer'] == eventalign_result['model_kmer'] ) is checking whether the value in these two columns are equal and will halt the process if they are not? I assume this is used to account for NNNNN model kmers?

One obvious and stupid way around would be to filter for mismatched and NNNNN kmer and then replace what remains with the values from model_kmer column (or vice versa is this makes better biological sense). Given that we are working with negative sense RNA would one of these simple options be reasonable or will this completely break the analysis by providing false kmers?

def combine(events_str: str) -> np.recarray:
    r'''
    Function to aggregate features from eventalign.txt on a single transcript id and extract mean current level, dwelling time, and current standard deviation from each position in each read
            Args:
                    events_str (str): String corresponding to portion within eventalign.txt to be processed that can be read as pd.DataFrame object

            Returns
                    np_events (np.recarray): A NumPy record array object that contains the extracted features
    '''
    f_string = StringIO(events_str)
    eventalign_result = pd.read_csv(f_string,delimiter='\t',
                                    names=['contig','position','reference_kmer','read_index','strand','event_index',
                                           'event_level_mean','event_stdv','event_length','model_kmer','model_mean',
                                           'model_stdv','standardized_level','start_idx','end_idx'])
    f_string.close()
    cond_successfully_eventaligned = eventalign_result['reference_kmer'] == eventalign_result['model_kmer']

    if cond_successfully_eventaligned.sum() != 0:

        eventalign_result = eventalign_result[cond_successfully_eventaligned]

        keys = ['read_index','contig','position','reference_kmer'] # for groupby
        eventalign_result.loc[:, 'length'] = pd.to_numeric(eventalign_result['end_idx']) - \
                pd.to_numeric(eventalign_result['start_idx'])
        eventalign_result.loc[:, 'sum_norm_mean'] = pd.to_numeric(eventalign_result['event_level_mean']) \
                * eventalign_result['length']
        eventalign_result.loc[:, 'sum_norm_std'] = pd.to_numeric(eventalign_result['event_stdv']) \
                * eventalign_result['length']
        eventalign_result.loc[:, 'sum_dwell_time'] = pd.to_numeric(eventalign_result['event_length']) \
                * eventalign_result['length']

        eventalign_result = eventalign_result.groupby(keys)
        sum_norm_mean = eventalign_result['sum_norm_mean'].sum()
        sum_norm_std = eventalign_result["sum_norm_std"].sum()
        sum_dwell_time = eventalign_result["sum_dwell_time"].sum()

        start_idx = eventalign_result['start_idx'].min()
        end_idx = eventalign_result['end_idx'].max()
        total_length = eventalign_result['length'].sum()

        eventalign_result = pd.concat([start_idx,end_idx],axis=1)
        eventalign_result['norm_mean'] = (sum_norm_mean/total_length).round(1)
        eventalign_result["norm_std"] = sum_norm_std / total_length
        eventalign_result["dwell_time"] = sum_dwell_time / total_length
        eventalign_result = eventalign_result.reset_index()

        eventalign_result['transcript_id'] = eventalign_result['contig']    #### CHANGE MADE ####
        eventalign_result['transcriptomic_position'] = \
                pd.to_numeric(eventalign_result['position']) + 2 # the middle position of 5-mers.
        features = ['transcript_id', 'read_index',
                    'transcriptomic_position', 'reference_kmer',
                    'norm_mean', 'norm_std', 'dwell_time']
        df_events = eventalign_result[features]
        np_events = np.rec.fromrecords(df_events, names=[*df_events])
        return np_events
joshtburdick commented 5 months ago

Disclaimer: I'm not sure if I was seeing the same issue that others were.

However, I was also getting an empty data.info file. This was after using f5c instead of nanopolish, and tweaking the format slightly to match the example event file, in the source at m6anet/tests/data/eventalign.txt.

That example file was running fine. I eventually noticed that it was sorted by read_index; I had been sorting events by position.

When I sorted the event file by read_index, I'm now getting a non-empty data.infofile, and a non-empty list of called modifications. Which seems promising, although I'm still working on checking the list of modifications...