Closed yeroslaviz closed 1 year ago
Hm, looks like NCBI have broken their API to fetch sample data, will do a check and fix it
Fixed, run
devtools::install_github("Roleren/ORFik")
I had to dig into the new API, looks like the Trace DB is being deprecated, well well. Fixed now at least.
thanks,
now it works with all options (PRJ, GSE, SRP).
I'm not sure if I need to open a new issue for that, as this is just a question.
If I already have the fastq file, and I don't want to download them again ( or if they are my own fastq files without a metadata file from GSE - how can I start an experiment?
Do I just need to create the conf object with my needed paths and put the fastq files in the raw_data folder? Can I also create the metadata file manually? Do I need it later on for the analysis or is it mainly for the naming of tha samepls.
Another error I get now, not sure if it is again related to changes in the links from ensembl, is when I try to download the annotation using this command:
getGenomeAndAnnotation("Saccharomyces cerevisiae", tempdir(), assembly_type = "toplevel")
I run into this error:
Starting toplevel genome retrieval of Saccharomyces_cerevisiae from ensembl:
Found organism but given release number did not specify existing file
in ensembl, maybe it is too old? Check that it exists on ensembl
first at all.
Remember some small genome organisms like yeast, does not have primary assemblies, then change assembly_type to toplevel and/or usedb = refseq.
Error in file.exists(filename) : invalid 'file' argument
This also happens when trying to download 'Danio rerio' or even 'homo spaiens'.
First of, the ORFik experiment config is for much more than file naming, among other linking libraries to annotation, organism, automatic loading of files, etc But it is not required by ORFik, but makes things a lot simpler down the road.
If you have a folder with custom files, just run create.experiment on that folder, and add sample information, organism etc. You get a csv file you can edit if needed.
Then run the function list.experiments when you are done to see which experiments you have made.
Secondly, about getGenomeAndAnnotation, this is most likely a curl bug upstream of ORFik. Maybe in the biomartr package, I will check if I can push a fix for that.
Ah, yes, there was a bug I made a Pull request to biomartr a while back, it has still not propagated to CRAN, so you have to do: For me it works, since I use devel of biomartr (Nothing I can do for now, until it will push to CRAN in a few weeks)
For it to work for you for now, do:
devtools::install_github("ropensci/biomartr")
Then you get the version with the patch I made.
If you have a folder with custom files, just run create.experiment on that folder, and add sample information, organism etc. You get a csv file you can edit if needed.
I would really appreciate it, if you can send me a template file with the necessary columns, so that i can input my own data into for the analysis.
can I run the create.experiment on a folder with fastq files or should the files be already mapped (bam files)?
I guess I will than need to first align them using STAR.align.folder
.
Check how I did it in the Ribo-seq vignette. First map reads, the run create.experiment on the bam folder and insert valid types.
My guess is for you that libtype = "RFP" (that is riboseq), and condition = "WT" and some experimental condition ? And set rep = c(1,2,3,1,2,3) if you have a 2 by 3 design etc.
ORFik tries to auto-guess the types, so if you have names like Ribo_WT_rep1, then it will just figure it out. Let me know what it outputs, then I can help you if you think it is wrong :)
I would appreciate the help, as it seems that the mapping was really bad. I have managed to map only a very small fraction of the reads to the S. cerevisiae (< 3%). It is the correct organism, as I am using a public data set (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114882).
For testing reasons, I am using only four samples for the run to be done quicker. These are steps I have done so far (If you like, I can also share the Rmd File i have for this analysis with you):
conf
objectIt seems that mapping to the rRNA is not an issue, as less than 10% of the reads were mapped to the contaminant runs.
one factor might be the fact that when using the auto
option in the aligner step, fastp
doesn't recognise the presence of the adapter and therefore doesn't trim anything. I will change the setting to the adapter sequence ( standard TruSeq sequence - AGATCGGAAGAGC
). Or is this also a versioning thing? (I'm using 0.23.2).
The fastp version should not matter I think, I think my warning system of failed adapter detection is a bit weak, but when you know it, it works well. So what you do is set it to the TRuSeq adapter sequences, yes.
Then you mapping rate should increase considerably.
The reason this happens is this: 1.The adapters are not detected, therefor not trimmed 2.STAR in local alignment (the default) requires 50% of the read to map, but this will fail a lot for you, since adapters are often ~ 50% of the read. 3.Your result will be a very poor mapping rate.
well, unfortunately not. I have tested two things now.
cutadapt
to trim the rads using the same adapter seq and then run the same command, but only with the steps = "co-ge"
parameter.While the run with cutadapt mapped still around 3%, the one with fastp and the adapter sequence given have increased the results to only around 7%.
So there must be something other than that wrong in the data. I am trying to see if the data contains any contamination, or maybe even a different organism.
So, as you can see in the image below, mapping against S. cerevisiae is correct. But I guess, the reason for the low mapping results is the very high number of duplications.
The question is, does it make sense to increase the number of allowed multi-mapped to increase the mapping results?
There the alignment is 50%, so that is great actually.
Where did you get the 3% and 7% values earlier ? Did you collapse reads before you aligned ?
Yes, this is what makes me wonder as well.
The image above was made with fastq-screen (do you know it?). The mapping was done on the raw fastq files with no trimming or filtering and the mapping is done with bowtie2
.
The 3% and 7% I got from the normal runs, the first with the adapter set to "auto" and the second when the adapter sequence was given in full.
Increasing the multi-mapped reads to 30 didn't increase the results significantly as expected.
Not really sure what to do next.
Just a thought. Would it maybe make sense to try and map the samples against the transcriptome and not genome?
Hm, I think something is off in the call to STAR, some argument that should not be there is included. Bowtie2 should never outperform STAR on spliced RNA data (like Ribo-seq). Could you send me the console output when you run ? (just rerun and abort after 2 sec, when all init console output is there, so I can verify all arguments)
Here is the command I'm using to execute the mapping
STAR.align.folder(input.dir = conf["fastq Ribo-seq"], output.dir = paste0(conf["bam Ribo-seq"],"AGATCGGAAGAGC_trim_multiMapped100/"), index.dir = index,
star.path = "/fs/home/yeroslaviz/miniconda3/envs/Mapping/bin/STAR",
fastp = "/fs/home/yeroslaviz/miniconda3/envs/Mapping/bin/fastp",
paired.end = FALSE,
steps = "tr-co-ge", # (trim needed: adapters found, then genome)
keep.contaminants = FALSE,
max.cpus = 15, # number of threads to use
adapter.sequence = "AGATCGGAAGAGC", # Adapters are auto detected
trim.front = 0, min.length = 20,
max.multimap = 100)
the complete output of this command is also attached here. STAR.align.folder_Output.txt
I played with the STAR parameters externally to see if I can increase the #mapped reads and I have (for now) ended up with this command
STAR --runThreadN 12 --genomeDir STARindex --readFilesIn fastq.gz
--outFileNamePrefix file. --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 7062663806
--winAnchorMultimapNmax 400 --outFilterMultimapNmax 200
--seedSearchStartLmax 10 --outFilterMismatchNoverLmax 0.05
--outFilterMatchNmin 0 --outFilterScoreMinOverLread 0
--outFilterMatchNminOverLread 0 --alignIntronMax 1
The command takes as input the fastq file after removing the adapter with cutadapt and filtering out reads which mapped to rRNA (this mapping is done using bowtie).
the results of the mapping is this:
Started job on | Jan 20 10:28:35
Started mapping on | Jan 20 10:28:37
Finished on | Jan 20 10:45:59
Mapping speed, Million of reads per hour | 94.67
Number of input reads | 27402469
Average input read length | 40
UNIQUE READS:
Uniquely mapped reads number | 12520204
Uniquely mapped reads % | 45.69%
Average mapped length | 16.40
Number of splices: Total | 0
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 0
Number of splices: GC/AG | 0
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 0
Mismatch rate per base, % | 0.18%
Deletion rate per base | 0.00%
Deletion average length | 1.00
Insertion rate per base | 0.00%
Insertion average length | 1.21
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 14873457
% of reads mapped to multiple loci | 54.28%
Number of reads mapped to too many loci | 0
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 8797
% of reads unmapped: too many mismatches | 0.03%
Number of reads unmapped: too short | 0
% of reads unmapped: too short | 0.00%
Number of reads unmapped: other | 11
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
This mapping was done against a specifically made transcriptome for the RiboSeq analysis with bases upstream of the TSS added to each of the sequences. so It cannot be used automatically with ORFik.
How can I add these parameters to the ORFik script to be run within the package?
thanks
Could you rerun the same input file after rRNA removal with ORFik default Star version 2.7.4a. and the internal orfik script (not your version) ? Need to know if it is the version of star or arguments where something is off.
STAR only takes an index during mapping, give the transciptome index instead and it should work in ORFik.
Transciptome mapping should never align better than genome mapping, so something is going on here, something strange..
well, this is embarrassing.
I have solved the problem, nothing technical or complicated. The biologist gave me the wrong adapter. We are working on two data sets in parallel one for S. cerevisiae and one for C. elegans.
Apparently I have used the wrong adapter here. Instead of AGATCGGAAGAGC
I needed to use CTGTAGGCACCATCAAT
.
Now I get these results:
Final statistics:
sample sample_id total mapped reads %-genome total mapped reads #-genome ...
1: GSM3152885 1 80.98 10820008 ...
2: SRR7214394 2 81.98 3812062 ...
3: SRR7214409 3 85.82 11535207 ...
4: SRR7214411 4 85.70 12337176 ...
I'm truly sorry for wasting your time for such trivialities. At least i have learned a lot in the process about your tool as well as how STAR works nicely for very short reads.
I'm closing this issue for now, but I am sure I'll contact you again downstream in the analysis.
thanks for all the help
Assa
Haha, you are not the first to do this, I have done this myself before.
The really cool thing is if you had used massiveNGSpipe (the package we will make public soon), it catches this by default. Since it can even find correct adapter in a multiadapter environment. Will let you know when we release it in a few days.
We might even include a shiny GUI so you can keep track of progress of processing (we could in theory connect it to discord etc, so you could get a status plot or a ping when something is done, a lot of cool possibilities).
This sounds cool. Would love to test it. let me know
When trying to download the metadata from an bioproject, I get this error:
I have tried different values:
only SRP148891 works correctly.
I have installed the development version using
devtools
.any ideas, if I need to change something?
Assa