BUStools / bustools

Tools for working with BUS files
https://bustools.github.io/
BSD 2-Clause "Simplified" License
89 stars 22 forks source link

nuclear RNA #11

Open lpantano opened 5 years ago

lpantano commented 5 years ago

Hi,

I am trying the tool for the quantification of nuclear RNA, after my recent tweets with Lior Patcher.

Every commands in the Velocity tutorial seems to work, but I want only the quantification by gene taking into account reads in introns and exons, so I am really don't want the outputs in the velocity tutorial.

I imagine that something like this is the solution after doing correct and sort:

bustools count -o counts -g t2g.txt -e matrix.ec -t transcripts.txt --genecounts output.sort.bus

but I want to make sure that t2g.txt is the correct. I noticed the annotation for introns is TRANSCRIPT.N-I meanwhile the exons is TRANSCRIPT.N. If the t2g.txt file is lie this:

ENST00000456328.1       ENSG00000223972 DDX11L1
ENST00000450305.2       ENSG00000223972 DDX11L1
ENST00000488147.3       ENSG00000227232 WASH7P

It would recognized introns and exons, or I need to add the intron annotation to the file?

Thanks!

Munfred commented 5 years ago

The cDNA_introns.t2g.txt file provided in the tutorial does have both intons and exons in the same txt file. You can take a look at the beginning and end of the file by doing:

$ wget https://github.com/BUStools/getting_started/releases/download/velocity_tutorial/cDNA_introns.t2g.txt.gz

$ gunzip cDNA_introns.t2g.txt.gz

$ head cDNA_introns.t2g.txt
ENST00000456328.1       ENSG00000223972 DDX11L1
ENST00000450305.2       ENSG00000223972 DDX11L1
ENST00000488147.3       ENSG00000227232 WASH7P
ENST00000619216.4       ENSG00000278267 MIR6859-1
ENST00000473358.5       ENSG00000243485 MIR1302-2HG
ENST00000469289.6       ENSG00000243485 MIR1302-2HG
ENST00000607096.7       ENSG00000284332 MIR1302-2
ENST00000417324.8       ENSG00000237613 FAM138A
ENST00000461467.9       ENSG00000237613 FAM138A
ENST00000606857.10      ENSG00000268020 OR4G4P

$ tail cDNA_introns.t2g.txt
ENST00000463325.1056070-I       ENSG00000184319 RPL23AP82
ENST00000463325.1056071-I       ENSG00000184319 RPL23AP82
ENST00000494075.1056072-I       ENSG00000184319 RPL23AP82
ENST00000494075.1056073-I       ENSG00000184319 RPL23AP82
ENST00000494075.1056074-I       ENSG00000184319 RPL23AP82
ENST00000423888.1056075-I       ENSG00000184319 RPL23AP82
ENST00000423888.1056076-I       ENSG00000184319 RPL23AP82
ENST00000423888.1056077-I       ENSG00000184319 RPL23AP82
ENST00000480246.1056078-I       ENSG00000184319 RPL23AP82
ENST00000480246.1056079-I       ENSG00000184319 RPL23AP82
lpantano commented 5 years ago

Thanks, so this command should be ok?:

bustools count -o genes/ --genecounts -g cDNA_introns.t2g.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

Munfred commented 5 years ago

Yes.

Just so you know, there is small issue right now in that bustools will not create the subfolder if you give it a multi folder output (e.g. if you want the output on ./sample2/genecounts) so you need to make sure the folder exist in that case. If you'd like to do that, this is how I'd make sure the folder gets created prior to running the command:

mkdir -p ./sample2/genecounts

bustools count \
--output  ./sample2/genecounts/genes \
--genecounts \
--genemap /references/homo_sapiens-ensembl-96/transcripts_to_genes.txt \
--ecmap ./sample2/matrix.ec \
--txnames  ./sample2/transcripts.txt \
./sample2/output.corrected.sorted.bus
lpantano commented 5 years ago

Thank you!

Sadly I am having a segmentation fault (with the conda and source code compiled by myself) only for that command, this is the debug output, I don't know if that helps.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x000000000043f9e0 in std::vector<int, std::allocator<int> >::begin (this=0x2aabd9af73f8) at /usr/include/c++/4.8.2/bits/stl_vector.h:548
548           { return const_iterator(this->_M_impl._M_start); }
(gdb) bt
#0  0x000000000043f9e0 in std::vector<int, std::allocator<int> >::begin (this=0x2aabd9af73f8) at /usr/include/c++/4.8.2/bits/stl_vector.h:548
#1  0x000000000045cb53 in intersect_genes_of_ecs (ecs=std::vector of length 1, capacity 100 = {...}, ec2genes=std::vector of length 12317869, capacity 16777216 = {...},
    glist=std::vector of length 0, capacity 100) at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/Common.cpp:174
#2  0x0000000000450ddb in __lambda1::operator() (__closure=0x7fffffffbdb0, v=std::vector of length 6345915, capacity 6400000 = {...})
    at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/bustools_count.cpp:177
#3  0x0000000000451a87 in bustools_count (opt=...) at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/bustools_count.cpp:241
#4  0x000000000043dd35 in main (argc=12, argv=0x7fffffffd868) at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/bustools_main.cpp:1072

The output folder has an empty MTX file.

the other count commands work, so is not this step specifically.

I will try the test data to make sure is not my sample, but I really want to have this working with my data, let me know if I can give you any more information.

If it is the cpp library version, can you tell me the one you have that is working so I can try to compile versus that particular version?

thanks!

Munfred commented 5 years ago

Please make sure that all of the files you're giving bustools exist (that the path is correct) and that the bus file is not empty.

If the issue persists please try again with test data.

Every time we've seen seg faults its because either we were giving it empty files, or paths to files that didn't exist.

On Tue, Jul 23, 2019, 10:35 Lorena Pantano notifications@github.com wrote:

Thank you!

Sadly I am having a segmentation fault (with the conda and source code compiled by myself) only for that command, this is the debug output, I don't know if that helps.

[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault. 0x000000000043f9e0 in std::vector<int, std::allocator >::begin (this=0x2aabd9af73f8) at /usr/include/c++/4.8.2/bits/stl_vector.h:548 548 { return const_iterator(this->_M_impl._M_start); } (gdb) bt

0 0x000000000043f9e0 in std::vector<int, std::allocator >::begin (this=0x2aabd9af73f8) at /usr/include/c++/4.8.2/bits/stl_vector.h:548

1 0x000000000045cb53 in intersect_genes_of_ecs (ecs=std::vector of length 1, capacity 100 = {...}, ec2genes=std::vector of length 12317869, capacity 16777216 = {...},

glist=std::vector of length 0, capacity 100) at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/Common.cpp:174

2 0x0000000000450ddb in lambda1::operator() (closure=0x7fffffffbdb0, v=std::vector of length 6345915, capacity 6400000 = {...})

at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/bustools_count.cpp:177

3 0x0000000000451a87 in bustools_count (opt=...) at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/bustools_count.cpp:241

4 0x000000000043dd35 in main (argc=12, argv=0x7fffffffd868) at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/bustools_main.cpp:1072

The output folder has an empty MTX file.

the other count commands work, so is not this step specifically.

I will try the test data to make sure is not my sample, but I really want to have this working with my data, let me know if I can give you any more information.

If it is the cpp library version, can you tell me the one you have that is working so I can try to compile versus that particular version?

thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BUStools/bustools/issues/11?email_source=notifications&email_token=AC7MY4FFG6TOVIQZLZMFIHLQA46MFA5CNFSM4IFLDGN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2T33AQ#issuecomment-514309506, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7MY4DY2GYRRHUCAWFHV73QA46MFANCNFSM4IFLDGNQ .

lpantano commented 5 years ago

Thank you.

files exist and they are not empty:

4.7G    matrix.ec
1.5G    output.correct.sort.bus
30M     transcripts.txt
58M  cDNA_introns.t2g.txt

output folder exists. this is the command I used:

bustools count -o genes/genes --genecounts -g cDNA_introns.t2g.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

it takes a while to give the error, like 20 min or so. There is an empty file at genes/genes.mtx.

I tried the with the example data: bus_output_06 and the same error appeared. I used only a pair of files when aligning to speed up computation:

06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R1_001.fastq.gz 06/10X_17_029_MissingLibrary_1_HL
73JBCXY/bamtofastq_S1_L002_R2_001.fastq.gz

I tried in a different cluster machines, and the same seg.fault appeared:

Program received signal SIGSEGV, Segmentation fault.
0x000055555557682b in intersect_genes_of_ecs(std::vector<int, std::allocator<int> > const&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, std::vector<int, std::allocator<int> >&) ()

I know it is a difficult error to catch up, I am happy to help in any way possible.

cheers

Munfred commented 5 years ago

If you can provide all files and the bustools binary you are using we will try to reproduce it and identify a fix

On Wed, 24 Jul 2019 at 08:25, Lorena Pantano notifications@github.com wrote:

Thank you.

files exist and they are not empty:

4.7G matrix.ec 1.5G output.correct.sort.bus 30M transcripts.txt 58M cDNA_introns.t2g.txt

output folder exists. this is the command I used:

bustools count -o genes/genes --genecounts -g cDNA_introns.t2g.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

it takes a while to give the error, like 20 min or so. There is an empty file at genes/genes.mtx.

I tried the with the example data: bus_output_06 and the same error appeared. I used only a pair of files when aligning to speed up computation:

06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R1_001.fastq.gz 06/10X_17_029_MissingLibrary_1_HL 73JBCXY/bamtofastq_S1_L002_R2_001.fastq.gz

I tried in a different cluster machines, and the same seg.fault appeared:

Program received signal SIGSEGV, Segmentation fault. 0x000055555557682b in intersect_genes_of_ecs(std::vector<int, std::allocator > const&, std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > > const&, std::vector<int, std::allocator >&) ()

I know it is a difficult error to catch up, I am happy to help in any way possible.

cheers

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BUStools/bustools/issues/11?email_source=notifications&email_token=AC7MY4BJM7TATBH7JEQLIUTQBBX5XA5CNFSM4IFLDGN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2WWHRI#issuecomment-514679749, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7MY4AWYB3XD3GP4QWLITDQBBX5XANCNFSM4IFLDGNQ .

lpantano commented 5 years ago

Thank you! Here are the files:

wget https://lpantano-debug.s3.amazonaws.com/bustools/bustools
wget https://lpantano-debug.s3.amazonaws.com/bustools/cDNA_introns.t2g.txt
wget https://lpantano-debug.s3.amazonaws.com/bustools/matrix.ec
wget https://lpantano-debug.s3.amazonaws.com/bustools/output.correct.sort.bus
wget https://lpantano-debug.s3.amazonaws.com/bustools/transcripts.txt

I installed bustools with conda.

thanks so much

Munfred commented 5 years ago

How did you generate this bus file? I converted the 2.6GB bus file you provided into text to inspect it and it expanded to 205GB.

Each entry that should correspond to a UMI has a length of 2573 charachters, and it repeats every 32 characters. I have attached the first 5 lines of the file for you to understand.

If you can provide the detailed steps you used to generate this file we can look further into the issue.

nuclear_rna_first_5_lines.txt

lpantano commented 5 years ago

This is for the the velocity tutorial, I just took on of the samples:


kallisto bus -i cDNA_introns.idx -o bus_output_06/ -x 10xv2 -t 4
06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R1_001.fastq.gz
06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R2_001.fastq.gz

bustools correct -w ../../10xv2_whitelist.txt -p output.bus | bustools sort
-o output.correct.sort.bus -t 4 -

Is this the information you need? I just followed the steps in the tutorial.

Thanks for helping with this.

On July 26, 2019 at 1:10:45 AM, Eduardo Beltrame (notifications@github.com) wrote:

How did you generate this bus file? I converted the 2.6GB bus file you provided into text to inspect it and it expanded to 205GB.

Each entry that should correspond to a UMI has a length of 2573 charachters, and it repeats every 32 characters. I have attached the first 5 lines of the file for you to understand.

If you can provide the detailed steps you used to generate this file we can look further into the issue.

nuclear_rna_first_5_lines.txt https://github.com/BUStools/bustools/files/3434254/nuclear_rna_first_5_lines.txt

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BUStools/bustools/issues/11?email_source=notifications&email_token=AAML6HHNZUSIGRUK2DXKBALQBKBNLA5CNFSM4IFLDGN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD23QNCI#issuecomment-515311241, or mute the thread https://github.com/notifications/unsubscribe-auth/AAML6HAZCKJOS4D5TBC546TQBKBNLANCNFSM4IFLDGNQ .

lpantano commented 5 years ago

The fastq files were done by:


bamtofastq --reads-per-fastq=500000000 10X_17_029.bam ./06

Can I convert the BUS file to text? I have my sample as well and I could see if that happens with my sample as well. Or I can give it to your.

Thanks

On July 26, 2019 at 9:00:29 AM, Lorena Pantano (lorena.pantano@gmail.com) wrote:

This is for the the velocity tutorial, I just took on of the samples:


kallisto bus -i cDNA_introns.idx -o bus_output_06/ -x 10xv2 -t 4
06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R1_001.fastq.gz
06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R2_001.fastq.gz

bustools correct -w ../../10xv2_whitelist.txt -p output.bus | bustools sort
-o output.correct.sort.bus -t 4 -

Is this the information you need? I just followed the steps in the tutorial.

Thanks for helping with this.

On July 26, 2019 at 1:10:45 AM, Eduardo Beltrame (notifications@github.com) wrote:

How did you generate this bus file? I converted the 2.6GB bus file you provided into text to inspect it and it expanded to 205GB.

Each entry that should correspond to a UMI has a length of 2573 charachters, and it repeats every 32 characters. I have attached the first 5 lines of the file for you to understand.

If you can provide the detailed steps you used to generate this file we can look further into the issue.

nuclear_rna_first_5_lines.txt https://github.com/BUStools/bustools/files/3434254/nuclear_rna_first_5_lines.txt

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BUStools/bustools/issues/11?email_source=notifications&email_token=AAML6HHNZUSIGRUK2DXKBALQBKBNLA5CNFSM4IFLDGN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD23QNCI#issuecomment-515311241, or mute the thread https://github.com/notifications/unsubscribe-auth/AAML6HAZCKJOS4D5TBC546TQBKBNLANCNFSM4IFLDGNQ .

Munfred commented 5 years ago

Yes you can convert it to text by doing bustools text -o bus_output_filename.txt bus_input_file.bus

On Fri, 26 Jul 2019 at 06:07, Lorena Pantano notifications@github.com wrote:

The fastq files were done by:


bamtofastq --reads-per-fastq=500000000 10X_17_029.bam ./06

Can I convert the BUS file to text? I have my sample as well and I could see if that happens with my sample as well. Or I can give it to your.

Thanks

On July 26, 2019 at 9:00:29 AM, Lorena Pantano (lorena.pantano@gmail.com) wrote:

This is for the the velocity tutorial, I just took on of the samples:


kallisto bus -i cDNA_introns.idx -o bus_output_06/ -x 10xv2 -t 4
06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R1_001.fastq.gz
06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R2_001.fastq.gz

bustools correct -w ../../10xv2_whitelist.txt -p output.bus | bustools sort
-o output.correct.sort.bus -t 4 -

Is this the information you need? I just followed the steps in the tutorial.

Thanks for helping with this.

On July 26, 2019 at 1:10:45 AM, Eduardo Beltrame (notifications@github.com ) wrote:

How did you generate this bus file? I converted the 2.6GB bus file you provided into text to inspect it and it expanded to 205GB.

Each entry that should correspond to a UMI has a length of 2573 charachters, and it repeats every 32 characters. I have attached the first 5 lines of the file for you to understand.

If you can provide the detailed steps you used to generate this file we can look further into the issue.

nuclear_rna_first_5_lines.txt < https://github.com/BUStools/bustools/files/3434254/nuclear_rna_first_5_lines.txt

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/BUStools/bustools/issues/11?email_source=notifications&email_token=AAML6HHNZUSIGRUK2DXKBALQBKBNLA5CNFSM4IFLDGN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD23QNCI#issuecomment-515311241

, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAML6HAZCKJOS4D5TBC546TQBKBNLANCNFSM4IFLDGNQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BUStools/bustools/issues/11?email_source=notifications&email_token=AC7MY4GLJ364NNXCRVAZGV3QBLZH7A5CNFSM4IFLDGN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD24QZHQ#issuecomment-515443870, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7MY4FGUUXK7SSKSJUVNUTQBLZH7ANCNFSM4IFLDGNQ .

lpantano commented 5 years ago

yes, the bus file from my sample looks the same than that. Used the same command I posted before to produce the bus file, with kallisto.

Munfred commented 5 years ago

Ok, can you confirm that the fastq files you generated using bamtofastq look correct? If you can paste here the output of head it would be good.

If the fastqs look correct then it might have been an issue with the files given to kallisto, if you can also post the commands you use to call kallisto bus and the files used that might solve the mystery

On Fri, 26 Jul 2019 at 10:57, Lorena Pantano notifications@github.com wrote:

yes, the bus file from my sample looks the same than that. Used the same command I posted before to produce the bus file, with kallisto.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BUStools/bustools/issues/11?email_source=notifications&email_token=AC7MY4F7TQT46GJOG52FOSTQBM3H7A5CNFSM4IFLDGN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD25JQGI#issuecomment-515545113, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7MY4CBIETH6LOXRGEKXA3QBM3H7ANCNFSM4IFLDGNQ .

lpantano commented 5 years ago

Sure.

These are the commands to create the bus file from the bam files:

bamtofastq --reads-per-fastq=500000000 10X_17_029.bam ./06

kallisto bus -i cDNA_introns.idx -o bus_output_06/ -x 10xv2 -t 4 06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R1_001.fastq.gz 06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R2_001.fastq.gz

bustools correct -w ../../10xv2_whitelist.txt -p output.bus | bustools sort -o output.correct.sort.bus -t 4 -

The other sample I have is not generated with bamtofastq, they were generated with cellranger mkfasta and the bus file is the same.

R1:

@D00456:228:HL73JBCXY:2:1111:15821:53600 1:N:0:0
CTAGCCCTAACCCTAACCCTAACCCT
+
.GAGA<GGGGAGGGGGGGGAAA.A<G
@D00456:228:HL73JBCXY:2:1213:6403:92501 1:N:0:0
AACCCTAACCCTAACCCTAACCCTAA
+
GGGGAGGGGGAGGAGGGGGG..AGAG
@D00456:228:HL73JBCXY:2:2106:9266:5974 1:N:0:0
ACGTCAAAGCAGCCTCCACAATGTGG

R2:

@D00456:228:HL73JBCXY:2:1111:15821:53600 3:N:0:0
GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGCTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGGATAGGGTTCGGTCTAGGGATAGG
+
AGGAAA..GG.<<.G<A.G<G..<G<G<GAGG..<AGAG<..<<A..<<GG.<<GGA<..<<<..<...<..<<.<<AA...<...<.<...<<....
@D00456:228:HL73JBCXY:2:1213:6403:92501 3:N:0:0
GGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGAGTAGGGATGGGGAGCGGGGAGGGGAGAGG
+
GA<G.AAA<GGGAG<AAGGGI<..<GGGGGAAAG..<.<.<.<AGAG.GGG.<.GAG..G<<<...<<....<<......<.<.<.......<..<..
@D00456:228:HL73JBCXY:2:2106:9266:5974 3:N:0:0
GCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAAGGCTGACGGCAAGTTAACGAAAAGAAAAATGGTGAATGATACCCGGTGC

In a last note, the velocity tutorial works if I use bustools capture and bustools count to get the two type of matrices. This is only happening when I want just to do bustools count with the matrix.ec file firectly since I want to quantify all reads into genes. (https://www.kallistobus.tools/velocity_tutorial.html)

thanks!

Munfred commented 5 years ago

Thanks, this all looks good. One last question - are you by any chance running this on a 32 bit machine?

On Fri, Jul 26, 2019, 13:16 Lorena Pantano notifications@github.com wrote:

Sure.

These are the commands to create the bus file from the bam files:

bamtofastq --reads-per-fastq=500000000 10X_17_029.bam ./06

kallisto bus -i cDNA_introns.idx -o bus_output_06/ -x 10xv2 -t 4 06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R1_001.fastq.gz 06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R2_001.fastq.gz

bustools correct -w ../../10xv2_whitelist.txt -p output.bus | bustools sort -o output.correct.sort.bus -t 4 -

The other sample I have is not generated with bamtofastq, they were generated with cellranger mkfasta and the bus file is the same.

R1:

@D00456:228:HL73JBCXY:2:1111:15821:53600 1:N:0:0 CTAGCCCTAACCCTAACCCTAACCCT + .GAGA<GGGGAGGGGGGGGAAA.A<G @D00456:228:HL73JBCXY:2:1213:6403:92501 1:N:0:0 AACCCTAACCCTAACCCTAACCCTAA + GGGGAGGGGGAGGAGGGGGG..AGAG @D00456:228:HL73JBCXY:2:2106:9266:5974 1:N:0:0 ACGTCAAAGCAGCCTCCACAATGTGG

R2:

@D00456:228:HL73JBCXY:2:1111:15821:53600 3:N:0:0 GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGCTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGGATAGGGTTCGGTCTAGGGATAGG + AGGAAA..GG.<<.G<A.G<G..<G<G<GAGG..<AGAG<..<<A..<<GG.<<GGA<..<<<..<...<..<<.<<AA...<...<.<...<<.... @D00456:228:HL73JBCXY:2:1213:6403:92501 3:N:0:0 GGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGAGTAGGGATGGGGAGCGGGGAGGGGAGAGG + GA<G.AAA<GGGAG<AAGGGI<..<GGGGGAAAG..<.<.<.<AGAG.GGG.<.GAG..G<<<...<<....<<......<.<.<.......<..<.. @D00456:228:HL73JBCXY:2:2106:9266:5974 3:N:0:0 GCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAAGGCTGACGGCAAGTTAACGAAAAGAAAAATGGTGAATGATACCCGGTGC

In a last note, the velocity tutorial works if I use bustools capture and bustools count to get the two type of matrices. This is only happening when I want just to do bustools count with the matrix.ec file firectly since I want to quantify all reads into genes. ( https://www.kallistobus.tools/velocity_tutorial.html)

thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BUStools/bustools/issues/11?email_source=notifications&email_token=AC7MY4BGCD4VK3OFTQIRICLQBNLRDA5CNFSM4IFLDGN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD25TMEY#issuecomment-515585555, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7MY4EK3CRRD5ULBB7ELXDQBNLRDANCNFSM4IFLDGNQ .

lpantano commented 5 years ago

thank you for confirming.

nope, 64 linux machines.

thanks!

Munfred commented 5 years ago

Ok we're very puzzled as to how the bus file with the 2573 characters in the UMI have been made, and ran out of ideas.

Can you subsample 100k reads from R1 and R2 so that we can try to reproduce the problematic bus file? seqtk is an easy way to subsample them: https://github.com/lh3/seqtk

lpantano commented 5 years ago

Thanks. I’ll do that on Monday.

I checked the BUS file from kallisto and looks like this:

CTGTTTACATACGCTA TAATAAGGAA 1746067 1 CTTAACTAGCTCTCGG ATCGTCCTGG 1746067 1 CTTTGCGAGCGATGAC TCCAGGGTGA 1746067 1 CTTTGCGGTACTCGCG TTAAATCGTG 1746067 1 CTTTGCGGTGTTCGAT CCCGCACGGT 1746067 1 CTTTGCGGTGTTCGAT CCCGCACGGT 1746067 1 GAAGCAGAGGCTCATT TGTTTTGCAC 1746067 1 GAAGCAGAGGCTCATT GTCTTTGATG 1746067 1 GAATGAACATGCTAGT AAATGAACAA 1746067 1 GACAGAGCAATCGAAA CACGGATGTC 1746067 1

But after bustools correct and sort it looks like this (it seems the problem is the correct | sort step):

AAAAAAAAAAAAAAAA AAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAA AAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAA AAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACC AAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAA AAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAA CAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAA AAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAA AAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAA AACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAA AAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAA AAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACC AAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAA AAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCAAAAAAAAAAAAAAAAAAAAAAAACAAAAACCA 0 3275751424

lpantano commented 4 years ago

Hi,

I uploaded the BUS file before sorting and correction that it seems right. Do you still want the FASTQ files?

https://lpantano-debug.s3.amazonaws.com/bustools/output.bus

As I posted last week, the weird bus appears after the correct command.

Thanks!