Closed brettvanderwerff closed 5 years ago
Hi Brett!
Right off the bat, I can see that the transcript IDs in the data don't match the IDs in the annotation. RATs should have given you a warning or error about that? Did you force it through?
Your myannot
looks perfect with just one code in each of the two columns in each row.
Your quantifications, on the other hand, have target_id
values with the transcript ID embedded in a larger string.
As stated in its own section of the input vignette, the set of transcript IDs in the annotation look-up table must match exactly the set of transcript IDs in the data, so that RATs knows how to group transcripts. If the set of IDs is different between the two, RATs can't match the data to the annotation. So you get 0 counts for all the IDs in the annotation, while IDs in the data not found in the annotation lack grouping information and are simply ignored and lost.
The values in target_id
are taken verbatim. RATs does not assume any specific formatting conventions and does not try to breakdown multi-field info, in order to safe-guard against accidents. It is your responsibility as a user to provide a look-up table suitable for your data.
You'll have to either clean up the IDs in the data (before importing it, otherwise it seems fish4rodents()
makes a bit of a mess trying to shoehorn it onto the annotation, from what I see...), or create an annotation look-up that uses the same long composite strings as target_ids that your data does.
I hope this helps, good luck!
Kimon
Hi Brett!
Right off the bat, I can see that the transcript IDs in the data don't match the IDs in the annotation. RATs should have given you a warning or error about that? Did you force it through?
No warning, I did not force it through
Your
myannot
looks perfect with just one code in each of the two columns in each row.Your quantifications, on the other hand, have
target_id
values with multiple codes and info separated by pipes.
Yes I thought that was suspicious too, but I geuss not suspicious enough π
As stated in its own section of the input vignette, the set of transcript IDs in the annotation look-up table must match exactly the set of transcript IDs in the data, so that RATs knows how to group transcripts. If the set of IDs is different between the two, RATs can't match the data to the annotation. So you get 0 counts for all the IDs in the annotation, while IDs in the data not found in the annotation lack grouping information and are simply ignored and lost.
Ok great! Sorry I did not see that section and issue #40 until after posting. Really sorry about that.
You'll have to either clean up the IDs in the data (before importing it, otherwise it seems
fish4rodents()
makes a bit of a mess trying to shoehorn it onto the annotation, from what I see...), or create an annotation look-up that uses the same long composite strings as target_ids that your data does.
So you are talking about parsing the abundance.tsv file and converting ie:
ENST00000000233.9|ENSG00000004059.10|OTTHUMG00000023246.6|OTTHUMT00000059567.2|ARF5-201|ARF5|1103|protein_coding|
to just include the ENST id:
ENST00000000233.9
for every row? If so that is a pretty straight forward fix.
I hope this helps, good luck!
Thank you, I really appreciate the feedback and help
Yep, that should do it. :)
I am however concerned that you did not get a warning about this. RATs should have sipmly not gone through with the run at all. Maybe the ID check is not as thorough as I thought it was.
Ok I will give this a shot tonight. You can probably close this issue, but if you want I can come back with a confirmation that this was the solution once I get a chance to try this out.
On second thought, no, a bit more complicated than that. You'll have to go into the abundance.h5, not the .tsv
The tsv does not contain the bootstraps, only the averages.
For the amount of time it could take to figure out the h5 format it may be simpler to re-run kallisto instead, ensuring the IDs in the gtf are clean from the start. Or to create the lookup table from the same gtf used for kallisto, keeping the long format throughout.
On 14 Feb 2019 00:32, Brett Vanderwerff notifications@github.com wrote:
Ok I will give this a shot tonight. You can probably close this issue, but if you want I can come back with a confirmation that this was the solution once I get a chance to try this out.
β You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/bartongroup/RATS/issues/64#issuecomment-463419427, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARTPOoDRHwpW7Di2Xayqi09mcf6ljWb2ks5vNKCXgaJpZM4a6W4B.
I ended up just changing changing the h5 files. It wasn't too bad and I learned more about the format. I was able to generate some plots after that with RATS and things look good so far. Thank you again. I am working with a gene that has many different isoforms so this tool is very interesting to me thank you for making RATS and maintaining it. Go ahead and close this if you want.
Great! Glad it is working for you now. Thank you for using RATs.
Hi,
I am having some issues, but would like to try and give as much information as possible about my workflow.
I am doing pseudo alignment with kallisto. I get the transcriptome from GENCODE by following the "transcript sequences" (CHR regions) link under fasta files heading in https://www.gencodegenes.org/human/:
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.transcripts.fa.gz
I build the rna index for kallisto like so:
./kallisto/kallisto index -i ./transcriptome/rna_index_gencode ./transcriptome/gencode.v29.transcripts.fa.gz
I then run kallisto to do pseudo alignment on paired end fastq files similar to the code shown below:
The files are actually a subset from this dataset: https://www.ebi.ac.uk/ena/data/view/PRJNA347513
I then run a little script with RATS to try things out. I use an annotation file from GENCODE by following the "Comprehensive gene annotation" (CHR regions) link under GTF/GFF3 heading in https://www.gencodegenes.org/human/:
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz
and run the script:
The strange thing is that after running
print( dtu_summary(mydtu) )
this I get:It seems strange that all of the transc/genes are ineligible. I'm not sure if this is a bug or just a negative result with my data.
this is what myannot looks like:
this is what mydtu looks like:
this is what mydata$boot_data_A looks like:
mydata$boot_data_B:
sesion info:
I have also tried running kalliso by using ensembls cDNA
file: ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
then running either of these annotation files, but got a similar result:
ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.chr_patch_hapl_scaff.gtf.gz
ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz
Sorry to bother you or if I have missed something obvious but I am pretty interested in your method and would really like to see this work.