bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

It hard to get how to use pretrained models #78

Closed artsiomkaltovich closed 4 years ago

artsiomkaltovich commented 4 years ago

Hello.

I am new in Nanosim tools, so sorry for possible stupid question, but I can get how to use pretrained models, what I should specify as -rt arg?

Thank you.

cheny19 commented 4 years ago

If you have the pre-trained models at hand, you just need to run simulator.py and specify the prefix of pre-trained model using -c option.

artsiomkaltovich commented 4 years ago

Hello @cheny19

I've tried

bin/simulator.py transcriptome -e ~/projects/ngs-analysis data/human_NA12878_dDNA_Bham1_guppy/expression_abundance.tsv -c human_NA12878_dRNA -o ~/projects/ngs-analysis/data/human-drna

And it failed because -rt param wasn't specified, what should be specified here? BTW is -e option necessary in such case?

cheny19 commented 4 years ago

-rt is the reference transcriptome that you want to simulate, and -e is the abundance profile which is also included in the pre-trained model zip file. If you want to simulate a transcriptome with different expression levels, you can modify that file. But if you want to simulate a different species, you'll need to create your own abundance profile with the transcript from that species. Please also refer to the README.md and help message in the tool for the usage.

artsiomkaltovich commented 4 years ago

Hello.

It still isn't clear. Could you specify a command the one should use?

Should I specify both -rt and -rg with the same gff3 file from pretrained model?

cheny19 commented 4 years ago

It depends on what transcriptome you want to simulate. If you want to simulate the same species, you can specify the same gff3 as in the pre-trained models. But -rt is used to specify transcriptome, not annotation file (gff3 file). A transcriptome should be a fasta file which you can download from Ensembl or RefSeq or UCSC, and there may be cdna or similar words in the name. We would suggest you download the latest version built on latest genome assembly, because these files are manually curated and updated regularly. If you still have trouble, you can show me the link you find, and we can double check for you.

artsiomkaltovich commented 4 years ago

We would suggest you download the latest version built on latest genome assembly

Ok, I thought the same version of reference as used for model training is required.

Thank you, I will try.

cheny19 commented 4 years ago

Not really, but your reference transcriptome (fasta) file has to match your annotation file (gtf) and abundance profile in the simulation stage. So if you use another version of fasta file, you may need to adjust your abundance profile, aka the transcripts in it, so NanoSim is able to find all corresponding transcript in the fasta file and simulate.

artsiomkaltovich commented 4 years ago

Hello again)

So could you specify a link where the one can download references for human_NA12878_dRNA_Bham1_guppy.tar.gz?

Also when I try to run the following command:

simulator.py transcriptome -rg GRCh38.primary_assembly.genome.fa -rt gencode.v32.transcripts.fa -e expression_abundance.tsv -o ~/project/isoquant/data/

It is failing with.

Traceback (most recent call last):
  File "/home/akaltovich/miniconda3/envs/nanosim/bin/simulator.py", line 1513, in <module>
    main()
  File "/home/akaltovich/miniconda3/envs/nanosim/bin/simulator.py", line 1503, in main
    read_profile(ref_g, ref_t, number, model_prefix, perfect, args.mode, strandness, exp, model_ir, "linear")
  File "/home/akaltovich/miniconda3/envs/nanosim/bin/simulator.py", line 397, in read_profile
    with open(model_prefix + "_match_markov_model", 'r') as mm_profile:
FileNotFoundError: [Errno 2] No such file or directory: 'training_match_markov_model'

Was that file missed in the archive?

cheny19 commented 4 years ago

Since it is a human dataset, you can use Ensembl ftp site to download the reference genome and transcriptome.

The reason why your command is you did not specify the prefix of your pre-trained model. You need to download those model, and since they are gzipped tar balls, you need to extract them, and use -c to specify the prefix of the pre-trained models (i.e. the common string shared among almost all the profile files)